Computer Vision and Pattern Recognition 100
☆ LiDAR-Event Stereo Fusion with Hallucinations ECCV 2024
Event stereo matching is an emerging technique to estimate depth from
neuromorphic cameras; however, events are unlikely to trigger in the absence of
motion or the presence of large, untextured regions, making the correspondence
problem extremely challenging. Purposely, we propose integrating a stereo event
camera with a fixed-frequency active sensor -- e.g., a LiDAR -- collecting
sparse depth measurements, overcoming the aforementioned limitations. Such
depth hints are used by hallucinating -- i.e., inserting fictitious events --
the stacks or raw input streams, compensating for the lack of information in
the absence of brightness changes. Our techniques are general, can be adapted
to any structured representation to stack events and outperform
state-of-the-art fusion methods applied to event-based stereo.
comment: ECCV 2024. Code: https://github.com/bartn8/eventvppstereo/ - Project
Page: https://eventvppstereo.github.io/
☆ Arctic-TILT. Business Document Understanding at Sub-Billion Scale
Łukasz Borchmann, Michał Pietruszka, Wojciech Jaśkowski, Dawid Jurkiewicz, Piotr Halama, Paweł Józiak, Łukasz Garncarek, Paweł Liskowski, Karolina Szyndler, Andrzej Gretkowski, Julita Ołtusek, Gabriela Nowakowska, Artur Zawłocki, Łukasz Duhr, Paweł Dyda, Michał Turski
The vast portion of workloads employing LLMs involves answering questions
grounded on PDF or scan content. We introduce the Arctic-TILT achieving
accuracy on par with models 1000$\times$ its size on these use cases. It can be
fine-tuned and deployed on a single 24GB GPU, lowering operational costs while
processing Visually Rich Documents with up to 400k tokens. The model
establishes state-of-the-art results on seven diverse Document Understanding
benchmarks, as well as provides reliable confidence scores and quick inference,
which are essential for processing files in large-scale or time-sensitive
enterprise environments.
☆ Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
We present Puppet-Master, an interactive video generative model that can
serve as a motion prior for part-level dynamics. At test time, given a single
image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can
synthesize a video depicting realistic part-level motion faithful to the given
drag interactions. This is achieved by fine-tuning a large-scale pre-trained
video diffusion model, for which we propose a new conditioning architecture to
inject the dragging control effectively. More importantly, we introduce the
all-to-first attention mechanism, a drop-in replacement for the widely adopted
spatial attention modules, which significantly improves generation quality by
addressing the appearance and background issues in existing models. Unlike
other motion-conditioned video generators that are trained on in-the-wild
videos and mostly move an entire object, Puppet-Master is learned from
Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We
propose a strategy to automatically filter out sub-optimal animations and
augment the synthetic renderings with meaningful motion trajectories.
Puppet-Master generalizes well to real images across various categories and
outperforms existing methods in a zero-shot manner on a real-world benchmark.
See our project page for more results: vgg-puppetmaster.github.io.
comment: Project page: https://vgg-puppetmaster.github.io/
☆ LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Standard natural language processing (NLP) pipelines operate on symbolic
representations of language, which typically consist of sequences of discrete
tokens. However, creating an analogous representation for ancient logographic
writing systems is an extremely labor intensive process that requires expert
knowledge. At present, a large portion of logographic data persists in a purely
visual form due to the absence of transcription -- this issue poses a
bottleneck for researchers seeking to apply NLP toolkits to study ancient
logographic languages: most of the relevant data are images of writing.
This paper investigates whether direct processing of visual representations
of language offers a potential solution. We introduce LogogramNLP, the first
benchmark enabling NLP analysis of ancient logographic languages, featuring
both transcribed and visual datasets for four writing systems along with
annotations for tasks like classification, translation, and parsing. Our
experiments compare systems that employ recent visual and text encoding
strategies as backbones. The results demonstrate that visual representations
outperform textual representations for some investigated tasks, suggesting that
visual processing pipelines may unlock a large amount of cultural heritage data
of logographic languages for NLP-based analyses.
☆ Quantifying the Impact of Population Shift Across Age and Sex for Abdominal Organ Segmentation MICCAI 2024
Deep learning-based medical image segmentation has seen tremendous progress
over the last decade, but there is still relatively little transfer into
clinical practice. One of the main barriers is the challenge of domain
generalisation, which requires segmentation models to maintain high performance
across a wide distribution of image data. This challenge is amplified by the
many factors that contribute to the diverse appearance of medical images, such
as acquisition conditions and patient characteristics. The impact of shifting
patient characteristics such as age and sex on segmentation performance remains
relatively under-studied, especially for abdominal organs, despite that this is
crucial for ensuring the fairness of the segmentation model. We perform the
first study to determine the impact of population shift with respect to age and
sex on abdominal CT image segmentation, by leveraging two large public
datasets, and introduce a novel metric to quantify the impact. We find that
population shift is a challenge similar in magnitude to cross-dataset shift for
abdominal organ segmentation, and that the effect is asymmetric and
dataset-dependent. We conclude that dataset diversity in terms of known patient
characteristics is not necessarily equivalent to dataset diversity in terms of
image features. This implies that simple population matching to ensure good
generalisation and fairness may be insufficient, and we recommend that fairness
research should be directed towards better understanding and quantifying
medical image dataset diversity in terms of performance-relevant
characteristics such as organ morphology.
comment: This paper has been accepted for publication by the MICCAI 2024
Fairness of AI in Medical Imaging (FAIMI) Workshop
☆ Enhanced Prototypical Part Network (EPPNet) For Explainable Image Classification Via Prototypes ICIP
Explainable Artificial Intelligence (xAI) has the potential to enhance the
transparency and trust of AI-based systems. Although accurate predictions can
be made using Deep Neural Networks (DNNs), the process used to arrive at such
predictions is usually hard to explain. In terms of perceptibly human-friendly
representations, such as word phrases in text or super-pixels in images,
prototype-based explanations can justify a model's decision. In this work, we
introduce a DNN architecture for image classification, the Enhanced
Prototypical Part Network (EPPNet), which achieves strong performance while
discovering relevant prototypes that can be used to explain the classification
results. This is achieved by introducing a novel cluster loss that helps to
discover more relevant human-understandable prototypes. We also introduce a
faithfulness score to evaluate the explainability of the results based on the
discovered prototypes. Our score not only accounts for the relevance of the
learned prototypes but also the performance of a model. Our evaluations on the
CUB-200-2011 dataset show that the EPPNet outperforms state-of-the-art
xAI-based methods, in terms of both classification accuracy and explainability
comment: Accepted at the International Conference on Image Processing (ICIP),
IEEE (2024), we will update the new version after published through IEEE
☆ Fall Detection for Industrial Setups Using YOLOv8 Variants
This paper presents the development of an industrial fall detection system
utilizing YOLOv8 variants, enhanced by our proposed augmentation pipeline to
increase dataset variance and improve detection accuracy. Among the models
evaluated, the YOLOv8m model, consisting of 25.9 million parameters and 79.1
GFLOPs, demonstrated a respectable balance between computational efficiency and
detection performance, achieving a mean Average Precision (mAP) of 0.971 at 50%
Intersection over Union (IoU) across both "Fall Detected" and "Human in Motion"
categories. Although the YOLOv8l and YOLOv8x models presented higher precision
and recall, particularly in fall detection, their higher computational demands
and model size make them less suitable for resource-constrained environments.
☆ Towards High-resolution 3D Anomaly Detection via Group-Level Feature Contrastive Learning
High-resolution point clouds~(HRPCD) anomaly detection~(AD) plays a critical
role in precision machining and high-end equipment manufacturing. Despite
considerable 3D-AD methods that have been proposed recently, they still cannot
meet the requirements of the HRPCD-AD task. There are several challenges: i) It
is difficult to directly capture HRPCD information due to large amounts of
points at the sample level; ii) The advanced transformer-based methods usually
obtain anisotropic features, leading to degradation of the representation; iii)
The proportion of abnormal areas is very small, which makes it difficult to
characterize. To address these challenges, we propose a novel group-level
feature-based network, called Group3AD, which has a significantly efficient
representation ability. First, we design an Intercluster Uniformity
Network~(IUN) to present the mapping of different groups in the feature space
as several clusters, and obtain a more uniform distribution between clusters
representing different parts of the point clouds in the feature space. Then, an
Intracluster Alignment Network~(IAN) is designed to encourage groups within the
cluster to be distributed tightly in the feature space. In addition, we propose
an Adaptive Group-Center Selection~(AGCS) based on geometric information to
improve the pixel density of potential anomalous regions during inference. The
experimental results verify the effectiveness of our proposed Group3AD, which
surpasses Reg3D-AD by the margin of 5\% in terms of object-level AUROC on
Real3D-AD. We provide the code and supplementary information on our website:
https://github.com/M-3LAB/Group3AD.
comment: ACMMM24, 12 pages, 5 figures
☆ Improving Network Interpretability via Explanation Consistency Evaluation
While deep neural networks have achieved remarkable performance, they tend to
lack transparency in prediction. The pursuit of greater interpretability in
neural networks often results in a degradation of their original performance.
Some works strive to improve both interpretability and performance, but they
primarily depend on meticulously imposed conditions. In this paper, we propose
a simple yet effective framework that acquires more explainable activation
heatmaps and simultaneously increase the model performance, without the need
for any extra supervision. Specifically, our concise framework introduces a new
metric, i.e., explanation consistency, to reweight the training samples
adaptively in model learning. The explanation consistency metric is utilized to
measure the similarity between the model's visual explanations of the original
samples and those of semantic-preserved adversarial samples, whose background
regions are perturbed by using image adversarial attack techniques. Our
framework then promotes the model learning by paying closer attention to those
training samples with a high difference in explanations (i.e., low explanation
consistency), for which the current model cannot provide robust
interpretations. Comprehensive experimental results on various benchmarks
demonstrate the superiority of our framework in multiple aspects, including
higher recognition accuracy, greater data debiasing capability, stronger
network robustness, and more precise localization ability on both regular
networks and interpretable networks. We also provide extensive ablation studies
and qualitative analyses to unveil the detailed contribution of each component.
comment: To appear in IEEE Transactions on Multimedia
☆ Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
High-performance Multimodal Large Language Models (MLLMs) rely heavily on
data quality. This study introduces a novel dataset named Img-Diff, designed to
enhance fine-grained image recognition in MLLMs by leveraging insights from
contrastive learning and image difference captioning. By analyzing object
differences between similar images, we challenge models to identify both
matching and distinct components. We utilize the Stable-Diffusion-XL model and
advanced image editing techniques to create pairs of similar images that
highlight object replacements. Our methodology includes a Difference Area
Generator for object differences identifying, followed by a Difference Captions
Generator for detailed difference descriptions. The result is a relatively
small but high-quality dataset of "object replacement" samples. We use the the
proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B,
yielding comprehensive improvements of performance scores over SOTA models that
trained with larger-scale datasets, in numerous image difference and Visual
Question Answering tasks. For instance, our trained models notably surpass the
SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate
alternative methods for generating image difference data through "object
removal" and conduct thorough evaluation to confirm the dataset's diversity,
quality, and robustness, presenting several insights on synthesis of such
contrastive dataset. To encourage further research and advance the field of
multimodal data synthesis and enhancement of MLLMs' fundamental capabilities
for image understanding, we release our codes and dataset at
https://github.com/modelscope/data-juicer/tree/ImgDiff.
comment: 14 pages, 9 figures, 7 tables
☆ SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation
The recent Segment Anything Model (SAM) 2 has demonstrated remarkable
foundational competence in semantic segmentation, with its memory mechanism and
mask decoder further addressing challenges in video tracking and object
occlusion, thereby achieving superior results in interactive segmentation for
both images and videos. Building upon our previous empirical studies, we
further explore the zero-shot segmentation performance of SAM 2 in
robot-assisted surgery based on prompts, alongside its robustness against
real-world corruption. For static images, we employ two forms of prompts:
1-point and bounding box, while for video sequences, the 1-point prompt is
applied to the initial frame. Through extensive experimentation on the MICCAI
EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box
prompts, outperforms state-of-the-art (SOTA) methods in comparative
evaluations. The results with point prompts also exhibit a substantial
enhancement over SAM's capabilities, nearing or even surpassing existing
unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference
speed and less performance degradation against various image corruption.
Although slightly unsatisfactory results remain in specific edges or regions,
SAM 2's robust adaptability to 1-point prompts underscores its potential for
downstream surgical tasks with limited prompt requirements.
comment: Empirical study. Previous work "SAM Meets Robotic Surgery" is
accessible at: arXiv:2308.07156
☆ HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts
Generalized Category Discovery (GCD) is a challenging task in which, given a
partially labelled dataset, models must categorize all unlabelled instances,
regardless of whether they come from labelled categories or from new ones. In
this paper, we challenge a remaining assumption in this task: that all images
share the same domain. Specifically, we introduce a new task and method to
handle GCD when the unlabelled data also contains images from different domains
to the labelled set. Our proposed `HiLo' networks extract High-level semantic
and Low-level domain features, before minimizing the mutual information between
the representations. Our intuition is that the clusterings based on domain
information and semantic information should be independent. We further extend
our method with a specialized domain augmentation tailored for the GCD task, as
well as a curriculum learning approach. Finally, we construct a benchmark from
corrupted fine-grained datasets as well as a large-scale evaluation on
DomainNet with real-world domain shifts, reimplementing a number of GCD
baselines in this setting. We demonstrate that HiLo outperforms SoTA category
discovery models by a large margin on all evaluations.
comment: 39 pages, 9 figures, 26 tables
☆ Sampling for View Synthesis: From Local Light Field Fusion to Neural Radiance Fields and Beyond
Capturing and rendering novel views of complex real-world scenes is a
long-standing problem in computer graphics and vision, with applications in
augmented and virtual reality, immersive experiences and 3D photography. The
advent of deep learning has enabled revolutionary advances in this area,
classically known as image-based rendering. However, previous approaches
require intractably dense view sampling or provide little or no guidance for
how users should sample views of a scene to reliably render high-quality novel
views. Local light field fusion proposes an algorithm for practical view
synthesis from an irregular grid of sampled views that first expands each
sampled view into a local light field via a multiplane image scene
representation, then renders novel views by blending adjacent local light
fields. Crucially, we extend traditional plenoptic sampling theory to derive a
bound that specifies precisely how densely users should sample views of a given
scene when using our algorithm. We achieve the perceptual quality of Nyquist
rate view sampling while using up to 4000x fewer views. Subsequent developments
have led to new scene representations for deep learning with view synthesis,
notably neural radiance fields, but the problem of sparse view synthesis from a
small number of images has only grown in importance. We reprise some of the
recent results on sparse and even single image view synthesis, while posing the
question of whether prescriptive sampling guidelines are feasible for the new
generation of image-based rendering algorithms.
comment: Article written for Frontiers of Science Award, International
Congress on Basic Science, 2024
☆ SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More
Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chunan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, Ying Zang
The advent of large models, also known as foundation models, has
significantly transformed the AI research landscape, with models like Segment
Anything (SAM) achieving notable success in diverse image segmentation
scenarios. Despite its advancements, SAM encountered limitations in handling
some complex low-level segmentation tasks like camouflaged object and medical
imaging. In response, in 2023, we introduced SAM-Adapter, which demonstrated
improved performance on these challenging tasks. Now, with the release of
Segment Anything 2 (SAM2), a successor with enhanced architecture and a larger
training corpus, we reassess these challenges. This paper introduces
SAM2-Adapter, the first adapter designed to overcome the persistent limitations
observed in SAM2 and achieve new state-of-the-art (SOTA) results in specific
downstream tasks including medical image segmentation, camouflaged (concealed)
object detection, and shadow detection. SAM2-Adapter builds on the
SAM-Adapter's strengths, offering enhanced generalizability and composability
for diverse applications. We present extensive experimental results
demonstrating SAM2-Adapter's effectiveness. We show the potential and encourage
the research community to leverage the SAM2 model with our SAM2-Adapter for
achieving superior segmentation outcomes. Code, pre-trained models, and data
processing protocols are available at
http://tianrun-chen.github.io/SAM-Adaptor/
comment: arXiv admin note: text overlap with arXiv:2304.09148
☆ Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches
3D Content Generation is at the heart of many computer graphics applications,
including video gaming, film-making, virtual and augmented reality, etc. This
paper proposes a novel deep-learning based approach for automatically
generating interactive and playable 3D game scenes, all from the user's casual
prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and
convenient way to convey the user's design intention in the content creation
process. To circumvent the data-deficient challenge in learning (i.e. the lack
of large training data of 3D scenes), our method leverages a pre-trained 2D
denoising diffusion model to generate a 2D image of the scene as the conceptual
guidance. In this process, we adopt the isometric projection mode to factor out
unknown camera poses while obtaining the scene layout. From the generated
isometric image, we use a pre-trained image understanding method to segment the
image into meaningful parts, such as off-ground objects, trees, and buildings,
and extract the 2D scene layout. These segments and layouts are subsequently
fed into a procedural content generation (PCG) engine, such as a 3D video game
engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can
be seamlessly integrated into a game development environment and is readily
playable. Extensive tests demonstrate that our method can efficiently generate
high-quality and interactive 3D game scenes with layouts that closely follow
the user's intention.
comment: Project Page: https://xrvisionlabs.github.io/Sketch2Scene/
☆ Depth Any Canopy: Leveraging Depth Foundation Models for Canopy Height Estimation ECCV 2024
Estimating global tree canopy height is crucial for forest conservation and
climate change applications. However, capturing high-resolution ground truth
canopy height using LiDAR is expensive and not available globally. An efficient
alternative is to train a canopy height estimator to operate on single-view
remotely sensed imagery. The primary obstacle to this approach is that these
methods require significant training data to generalize well globally and
across uncommon edge cases. Recent monocular depth estimation foundation models
have show strong zero-shot performance even for complex scenes. In this paper
we leverage the representations learned by these models to transfer to the
remote sensing domain for measuring canopy height. Our findings suggest that
our proposed Depth Any Canopy, the result of fine-tuning the Depth Anything v2
model for canopy height estimation, provides a performant and efficient
solution, surpassing the current state-of-the-art with superior or comparable
performance using only a fraction of the computational resources and
parameters. Furthermore, our approach requires less than \$1.30 in compute and
results in an estimated carbon footprint of 0.14 kgCO2. Code, experimental
results, and model checkpoints are openly available at
https://github.com/DarthReca/depth-any-canopy.
comment: Accepted at ECCV 2024 CV4E Workshop
☆ Saliency Detection in Educational Videos: Analyzing the Performance of Current Models, Identifying Limitations and Advancement Directions
Identifying the regions of a learning resource that a learner pays attention
to is crucial for assessing the material's impact and improving its design and
related support systems. Saliency detection in videos addresses the automatic
recognition of attention-drawing regions in single frames. In educational
settings, the recognition of pertinent regions in a video's visual stream can
enhance content accessibility and information retrieval tasks such as video
segmentation, navigation, and summarization. Such advancements can pave the way
for the development of advanced AI-assisted technologies that support learning
with greater efficacy. However, this task becomes particularly challenging for
educational videos due to the combination of unique characteristics such as
text, voice, illustrations, animations, and more. To the best of our knowledge,
there is currently no study that evaluates saliency detection approaches in
educational videos. In this paper, we address this gap by evaluating four
state-of-the-art saliency detection approaches for educational videos. We
reproduce the original studies and explore the replication capabilities for
general-purpose (non-educational) datasets. Then, we investigate the
generalization capabilities of the models and evaluate their performance on
educational videos. We conduct a comprehensive analysis to identify common
failure scenarios and possible areas of improvement. Our experimental results
show that educational videos remain a challenging context for generic video
saliency detection models.
☆ Towards Synergistic Deep Learning Models for Volumetric Cirrhotic Liver Segmentation in MRIs
Vandan Gorade, Onkar Susladkar, Gorkem Durak, Elif Keles, Ertugrul Aktas, Timurhan Cebeci, Alpay Medetalibeyoglu, Daniela Ladner, Debesh Jha, Ulas Bagci
Liver cirrhosis, a leading cause of global mortality, requires precise
segmentation of ROIs for effective disease monitoring and treatment planning.
Existing segmentation models often fail to capture complex feature interactions
and generalize across diverse datasets. To address these limitations, we
propose a novel synergistic theory that leverages complementary latent spaces
for enhanced feature interaction modeling. Our proposed architecture,
nnSynergyNet3D integrates continuous and discrete latent spaces for 3D volumes
and features auto-configured training. This approach captures both fine-grained
and coarse features, enabling effective modeling of intricate feature
interactions. We empirically validated nnSynergyNet3D on a private dataset of
628 high-resolution T1 abdominal MRI scans from 339 patients. Our model
outperformed the baseline nnUNet3D by approximately 2%. Additionally, zero-shot
testing on healthy liver CT scans from the public LiTS dataset demonstrated
superior cross-modal generalization capabilities. These results highlight the
potential of synergistic latent space models to improve segmentation accuracy
and robustness, thereby enhancing clinical workflows by ensuring consistency
across CT and MRI modalities.
☆ SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios ICPR
Most of the sophisticated AI models utilize huge amounts of annotated data
and heavy training to achieve high-end performance. However, there are certain
challenges that hinder the deployment of AI models "in-the-wild" scenarios,
i.e., inefficient use of unlabeled data, lack of incorporation of human
expertise, and lack of interpretation of the results. To mitigate these
challenges, we propose a novel Explainable Active Learning (XAL) model,
XAL-based semantic segmentation model "SegXAL", that can (i) effectively
utilize the unlabeled data, (ii) facilitate the "Human-in-the-loop" paradigm,
and (iii) augment the model decisions in an interpretable way. In particular,
we investigate the application of the SegXAL model for semantic segmentation in
driving scene scenarios. The SegXAL model proposes the image regions that
require labeling assistance from Oracle by dint of explainable AI (XAI) and
uncertainty measures in a weakly-supervised manner. Specifically, we propose a
novel Proximity-aware Explainable-AI (PAE) module and Entropy-based Uncertainty
(EBU) module to get an Explainable Error Mask, which enables the machine
teachers/human experts to provide intuitive reasoning behind the results and to
solicit feedback to the AI system via an active learning strategy. Such a
mechanism bridges the semantic gap between man and machine through
collaborative intelligence, where humans and AI actively enhance each other's
complementary strengths. A novel high-confidence sample selection technique
based on the DICE similarity coefficient is also presented within the SegXAL
framework. Extensive quantitative and qualitative analyses are carried out in
the benchmarking Cityscape dataset. Results show the outperformance of our
proposed SegXAL against other state-of-the-art models.
comment: 17 pages, 7 figures. To appear in the proceedings of the 27th
International Conference on Pattern Recognition (ICPR), 01-05 December, 2024,
Kolkata, India
☆ What could go wrong? Discovering and describing failure modes in computer vision
Deep learning models are effective, yet brittle. Even carefully trained,
their behavior tends to be hard to predict when confronted with
out-of-distribution samples. In this work, our goal is to propose a simple yet
effective solution to predict and describe via natural language potential
failure modes of computer vision models. Given a pretrained model and a set of
samples, our aim is to find sentences that accurately describe the visual
conditions in which the model underperforms. In order to study this important
topic and foster future research on it, we formalize the problem of
Language-Based Error Explainability (LBEE) and propose a set of metrics to
evaluate and compare different methods for this task. We propose solutions that
operate in a joint vision-and-language embedding space, and can characterize
through language descriptions model failures caused, e.g., by objects unseen
during training or adverse visual conditions. We experiment with different
tasks, such as classification under the presence of dataset bias and semantic
segmentation in unseen environments, and show that the proposed methodology
isolates nontrivial sentences associated with specific error causes. We hope
our work will help practitioners better understand the behavior of models,
increasing their overall safety and interpretability.
☆ Deep Learning for identifying systolic complexes in SCG traces: a cross-dataset analysis
The seismocardiographic signal is a promising alternative to the traditional
ECG in the analysis of the cardiac activity. In particular, the systolic
complex is known to be the most informative part of the seismocardiogram, thus
requiring further analysis. State-of-art solutions to detect the systolic
complex are based on Deep Learning models, which have been proven effective in
pioneering studies. However, these solutions have only been tested in a
controlled scenario considering only clean signals acquired from users
maintained still in supine position. On top of that, all these studies consider
data coming from a single dataset, ignoring the benefits and challenges related
to a cross-dataset scenario. In this work, a cross-dataset experimental
analysis was performed considering also data from a real-world scenario. Our
findings prove the effectiveness of a deep learning solution, while showing the
importance of a personalization step to contrast the domain shift, namely a
change in data distribution between training and testing data. Finally, we
demonstrate the benefits of a multi-channels approach, leveraging the
information extracted from both accelerometers and gyroscopes data.
☆ A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery MICCAI 2024
As a crucial and intricate task in robotic minimally invasive surgery,
reconstructing surgical scenes using stereo or monocular endoscopic video holds
immense potential for clinical applications. NeRF-based techniques have
recently garnered attention for the ability to reconstruct scenes implicitly.
On the other hand, Gaussian splatting-based 3D-GS represents scenes explicitly
using 3D Gaussians and projects them onto a 2D plane as a replacement for the
complex volume rendering in NeRF. However, these methods face challenges
regarding surgical scene reconstruction, such as slow inference, dynamic
scenes, and surgical tool occlusion. This work explores and reviews
state-of-the-art (SOTA) approaches, discussing their innovations and
implementation principles. Furthermore, we replicate the models and conduct
testing and evaluation on two datasets. The test results demonstrate that with
advancements in these techniques, achieving real-time, high-quality
reconstructions becomes feasible.
comment: To appear in MICCAI 2024 EARTH Workshop. Code availability:
https://github.com/Epsilon404/surgicalnerf
☆ Clutter Classification Using Deep Learning in Multiple Stages
Path loss prediction for wireless communications is highly dependent on the
local environment. Propagation models including clutter information have been
shown to significantly increase model accuracy. This paper explores the
application of deep learning to satellite imagery to identify environmental
clutter types automatically. Recognizing these clutter types has numerous uses,
but our main application is to use clutter information to enhance propagation
prediction models. Knowing the type of obstruction (tree, building, and further
classifications) can improve the prediction accuracy of key propagation metrics
such as path loss.
comment: SoutheastCon 2024
☆ MultiViPerFrOG: A Globally Optimized Multi-Viewpoint Perception Framework for Camera Motion and Tissue Deformation
Reconstructing the 3D shape of a deformable environment from the information
captured by a moving depth camera is highly relevant to surgery. The underlying
challenge is the fact that simultaneously estimating camera motion and tissue
deformation in a fully deformable scene is an ill-posed problem, especially
from a single arbitrarily moving viewpoint. Current solutions are often
organ-specific and lack the robustness required to handle large deformations.
Here we propose a multi-viewpoint global optimization framework that can
flexibly integrate the output of low-level perception modules (data
association, depth, and relative scene flow) with kinematic and scene-modeling
priors to jointly estimate multiple camera motions and absolute scene flow. We
use simulated noisy data to show three practical examples that successfully
constrain the convergence to a unique solution. Overall, our method shows
robustness to combined noisy input measures and can process hundreds of points
in a few milliseconds. MultiViPerFrOG builds a generalized learning-free
scaffolding for spatio-temporal encoding that can unlock advanced surgical
scene representations and will facilitate the development of the
computer-assisted-surgery technologies of the future.
☆ Detecting Car Speed using Object Detection and Depth Estimation: A Deep Learning Framework
Road accidents are quite common in almost every part of the world, and, in
majority, fatal accidents are attributed to over speeding of vehicles. The
tendency to over speeding is usually tried to be controlled using check points
at various parts of the road but not all traffic police have the device to
check speed with existing speed estimating devices such as LIDAR based, or
Radar based guns. The current project tries to address the issue of vehicle
speed estimation with handheld devices such as mobile phones or wearable
cameras with network connection to estimate the speed using deep learning
frameworks.
comment: This is the pre-print of the paper which was accepted for oral
presentation and publication in the proceedings of IEEE CONIT 2024, organized
at Pune from June 21 to 23, 2024. The paper is 6 pages long and it contains
11 figures and 1 table. This is not the final version of the paper
☆ AggSS: An Aggregated Self-Supervised Approach for Class-Incremental Learning BMVC 2024
This paper investigates the impact of self-supervised learning, specifically
image rotations, on various class-incremental learning paradigms. Here, each
image with a predefined rotation is considered as a new class for training. At
inference, all image rotation predictions are aggregated for the final
prediction, a strategy we term Aggregated Self-Supervision (AggSS). We observe
a shift in the deep neural network's attention towards intrinsic object
features as it learns through AggSS strategy. This learning approach
significantly enhances class-incremental learning by promoting robust feature
learning. AggSS serves as a plug-and-play module that can be seamlessly
incorporated into any class-incremental learning framework, leveraging its
powerful feature learning capabilities to enhance performance across various
class-incremental learning approaches. Extensive experiments conducted on
standard incremental learning datasets CIFAR-100 and ImageNet-Subset
demonstrate the significant role of AggSS in improving performance within these
paradigms.
comment: Accepted in BMVC 2024
☆ Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs
Large language models (LLMs) and large multimodal models (LMMs) have
significantly impacted the AI community, industry, and various economic
sectors. In journalism, integrating AI poses unique challenges and
opportunities, particularly in enhancing the quality and efficiency of news
reporting. This study explores how LLMs and LMMs can assist journalistic
practice by generating contextualised captions for images accompanying news
articles. We conducted experiments using the GoodNews dataset to evaluate the
ability of LMMs (BLIP-2, GPT-4v, or LLaVA) to incorporate one of two types of
context: entire news articles, or extracted named entities. In addition, we
compared their performance to a two-stage pipeline composed of a captioning
model (BLIP-2, OFA, or ViT-GPT2) with post-hoc contextualisation with LLMs
(GPT-4 or LLaMA). We assess a diversity of models, and we find that while the
choice of contextualisation model is a significant factor for the two-stage
pipelines, this is not the case in the LMMs, where smaller, open-source models
perform well compared to proprietary, GPT-powered ones. Additionally, we found
that controlling the amount of provided context enhances performance. These
results highlight the limitations of a fully automated approach and underscore
the necessity for an interactive, human-in-the-loop strategy.
☆ Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection ACM MM2024
Salient Object Detection (SOD) aims to identify and segment the most
prominent objects in images. Advanced SOD methods often utilize various
Convolutional Neural Networks (CNN) or Transformers for deep feature
extraction. However, these methods still deliver low performance and poor
generalization in complex cases. Recently, Segment Anything Model (SAM) has
been proposed as a visual fundamental model, which gives strong segmentation
and generalization capabilities. Nonetheless, SAM requires accurate prompts of
target objects, which are unavailable in SOD. Additionally, SAM lacks the
utilization of multi-scale and multi-level information, as well as the
incorporation of fine-grained details. To address these shortcomings, we
propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we
first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to
learn multi-scale information with very few trainable parameters. Then, we
propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the
multi-level information from the SAM's encoder. Finally, we propose a Detail
Enhancement Module (DEM) to incorporate SAM with fine-grained details.
Experimental results demonstrate the superior performance of our model on
multiple SOD datasets and its strong generalization on other segmentation
tasks. The source code is released at https://github.com/BellyBeauty/MDSAM.
comment: This work is accepted by ACM MM2024
☆ Deep Transfer Learning for Kidney Cancer Diagnosis
Yassine Habchi, Hamza Kheddar, Yassine Himeur, Abdelkrim Boukabou, Shadi Atalla, Wathiq Mansoor, Hussain Al-Ahmad
Many incurable diseases prevalent across global societies stem from various
influences, including lifestyle choices, economic conditions, social factors,
and genetics. Research predominantly focuses on these diseases due to their
widespread nature, aiming to decrease mortality, enhance treatment options, and
improve healthcare standards. Among these, kidney disease stands out as a
particularly severe condition affecting men and women worldwide. Nonetheless,
there is a pressing need for continued research into innovative, early
diagnostic methods to develop more effective treatments for such diseases.
Recently, automatic diagnosis of Kidney Cancer has become an important
challenge especially when using deep learning (DL) due to the importance of
training medical datasets, which in most cases are difficult and expensive to
obtain. Furthermore, in most cases, algorithms require data from the same
domain and a powerful computer with efficient storage capacity. To overcome
this issue, a new type of learning known as transfer learning (TL) has been
proposed that can produce impressive results based on other different
pre-trained data. This paper presents, to the best of the authors' knowledge,
the first comprehensive survey of DL-based TL frameworks for kidney cancer
diagnosis. This is a strong contribution to help researchers understand the
current challenges and perspectives of this topic. Hence, the main limitations
and advantages of each framework are identified and detailed critical analyses
are provided. Looking ahead, the article identifies promising directions for
future research. Moving on, the discussion is concluded by reflecting on the
pivotal role of TL in the development of precision medicine and its effects on
clinical practice and research in oncology.
comment: 32 pages, 8 figures and 8 tables
☆ An Explainable Non-local Network for COVID-19 Diagnosis
The CNN has achieved excellent results in the automatic classification of
medical images. In this study, we propose a novel deep residual 3D attention
non-local network (NL-RAN) to classify CT images included COVID-19, common
pneumonia, and normal to perform rapid and explainable COVID-19 diagnosis. We
built a deep residual 3D attention non-local network that could achieve
end-to-end training. The network is embedded with a nonlocal module to capture
global information, while a 3D attention module is embedded to focus on the
details of the lesion so that it can directly analyze the 3D lung CT and output
the classification results. The output of the attention module can be used as a
heat map to increase the interpretability of the model. 4079 3D CT scans were
included in this study. Each scan had a unique label (novel coronavirus
pneumonia, common pneumonia, and normal). The CT scans cohort was randomly
split into a training set of 3263 scans, a validation set of 408 scans, and a
testing set of 408 scans. And compare with existing mainstream classification
methods, such as CovNet, CBAM, ResNet, etc. Simultaneously compare the
visualization results with visualization methods such as CAM. Model performance
was evaluated using the Area Under the ROC Curve(AUC), precision, and F1-score.
The NL-RAN achieved the AUC of 0.9903, the precision of 0.9473, and the
F1-score of 0.9462, surpass all the classification methods compared. The heat
map output by the attention module is also clearer than the heat map output by
CAM. Our experimental results indicate that our proposed method performs
significantly better than existing methods. In addition, the first attention
module outputs a heat map containing detailed outline information to increase
the interpretability of the model. Our experiments indicate that the inference
of our model is fast. It can provide real-time assistance with diagnosis.
☆ Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation
Wan Li, Xinyun Zhong, Wei Li, Song Zhang, Moheng Rong, Yan Xi, Peng Yuan, Zechen Wang, Xiaolei Jiang, Rongxi Yi, Hui Tang, Yang Chen, Chaohui Tong, Zhan Wu, Feng Wang
Currently, lung cancer is a leading cause of global cancer mortality, often
necessitating minimally invasive interventions. Microwave ablation (MWA) is
extensively utilized for both primary and secondary lung tumors. Although
numerous clinical guidelines and standards for MWA have been established, the
clinical evaluation of ablation surgery remains challenging and requires
long-term patient follow-up for confirmation. In this paper, we propose a
method termed respiratory subtraction to evaluate lung tumor ablation therapy
performance based on pre- and post-operative image guidance. Initially,
preoperative images undergo coarse rigid registration to their corresponding
postoperative positions, followed by further non-rigid registration.
Subsequently, subtraction images are generated by subtracting the registered
preoperative images from the postoperative ones. Furthermore, to enhance the
clinical assessment of MWA treatment performance, we devise a quantitative
analysis metric to evaluate ablation efficacy by comparing differences between
tumor areas and treatment areas. To the best of our knowledge, this is the
pioneering work in the field to facilitate the assessment of MWA surgery
performance on pulmonary tumors. Extensive experiments involving 35 clinical
cases further validate the efficacy of the respiratory subtraction method. The
experimental results confirm the effectiveness of the respiratory subtraction
method and the proposed quantitative evaluation metric in assessing lung tumor
treatment.
☆ Dual-branch PolSAR Image Classification Based on GraphMAE and Local Feature Extraction
The annotation of polarimetric synthetic aperture radar (PolSAR) images is a
labor-intensive and time-consuming process. Therefore, classifying PolSAR
images with limited labels is a challenging task in remote sensing domain. In
recent years, self-supervised learning approaches have proven effective in
PolSAR image classification with sparse labels. However, we observe a lack of
research on generative selfsupervised learning in the studied task. Motivated
by this, we propose a dual-branch classification model based on generative
self-supervised learning in this paper. The first branch is a
superpixel-branch, which learns superpixel-level polarimetric representations
using a generative self-supervised graph masked autoencoder. To acquire finer
classification results, a convolutional neural networks-based pixel-branch is
further incorporated to learn pixel-level features. Classification with fused
dual-branch features is finally performed to obtain the predictions.
Experimental results on the benchmark Flevoland dataset demonstrate that our
approach yields promising classification results.
☆ Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach
Pneumonia, a severe respiratory disease, poses significant diagnostic
challenges, especially in underdeveloped regions. Traditional diagnostic
methods, such as chest X-rays, suffer from variability in interpretation among
radiologists, necessitating reliable automated tools. In this study, we propose
a novel approach combining deep learning and transformer-based attention
mechanisms to enhance pneumonia detection from chest X-rays. Our method begins
with lung segmentation using a TransUNet model that integrates our specialized
transformer module, which has fewer parameters compared to common transformers
while maintaining performance. This model is trained on the "Chest Xray Masks
and Labels" dataset and then applied to the Kermany and Cohen datasets to
isolate lung regions, enhancing subsequent classification tasks. For
classification, we employ pre-trained ResNet models (ResNet-50 and ResNet-101)
to extract multi-scale feature maps, processed through our modified transformer
module. By employing our specialized transformer, we attain superior results
with significantly fewer parameters compared to common transformer models. Our
approach achieves high accuracy rates of 92.79% on the Kermany dataset and
95.11% on the Cohen dataset, ensuring robust and efficient performance suitable
for resource-constrained environments.
"https://github.com/amirrezafateh/Multi-Scale-Transformer-Pneumonia"
☆ SG-JND: Semantic-Guided Just Noticeable Distortion Predictor For Image Compression ICIP 2024
Linhan Cao, Wei Sun, Xiongkuo Min, Jun Jia, Zicheng Zhang, Zijian Chen, Yucheng Zhu, Lizhou Liu, Qiubo Chen, Jing Chen, Guangtao Zhai
Just noticeable distortion (JND), representing the threshold of distortion in
an image that is minimally perceptible to the human visual system (HVS), is
crucial for image compression algorithms to achieve a trade-off between
transmission bit rate and image quality. However, traditional JND prediction
methods only rely on pixel-level or sub-band level features, lacking the
ability to capture the impact of image content on JND. To bridge this gap, we
propose a Semantic-Guided JND (SG-JND) network to leverage semantic information
for JND prediction. In particular, SG-JND consists of three essential modules:
the image preprocessing module extracts semantic-level patches from images, the
feature extraction module extracts multi-layer features by utilizing the
cross-scale attention layers, and the JND prediction module regresses the
extracted features into the final JND value. Experimental results show that
SG-JND achieves the state-of-the-art performance on two publicly available JND
datasets, which demonstrates the effectiveness of SG-JND and highlight the
significance of incorporating semantic information in JND assessment.
comment: Accepted by ICIP 2024
☆ Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods
Yiming Zhou, Zixuan Zeng, Andi Chen, Xiaofan Zhou, Haowei Ni, Shiyao Zhang, Panfeng Li, Liangxi Liu, Mengyao Zheng, Xupeng Chen
Exploring the capabilities of Neural Radiance Fields (NeRF) and
Gaussian-based methods in the context of 3D scene reconstruction, this study
contrasts these modern approaches with traditional Simultaneous Localization
and Mapping (SLAM) systems. Utilizing datasets such as Replica and ScanNet, we
assess performance based on tracking accuracy, mapping fidelity, and view
synthesis. Findings reveal that NeRF excels in view synthesis, offering unique
capabilities in generating new perspectives from existing data, albeit at
slower processing speeds. Conversely, Gaussian-based methods provide rapid
processing and significant expressiveness but lack comprehensive scene
completion. Enhanced by global optimization and loop closure techniques, newer
methods like NICE-SLAM and SplaTAM not only surpass older frameworks such as
ORB-SLAM2 in terms of robustness but also demonstrate superior performance in
dynamic and complex environments. This comparative analysis bridges theoretical
research with practical implications, shedding light on future developments in
robust 3D scene reconstruction across various real-world applications.
comment: Accepted by 2024 6th International Conference on Data-driven
Optimization of Complex Systems
☆ CoBooM: Codebook Guided Bootstrapping for Medical Image Representation Learning MICCAI 2024
Self-supervised learning (SSL) has emerged as a promising paradigm for
medical image analysis by harnessing unannotated data. Despite their potential,
the existing SSL approaches overlook the high anatomical similarity inherent in
medical images. This makes it challenging for SSL methods to capture diverse
semantic content in medical images consistently. This work introduces a novel
and generalized solution that implicitly exploits anatomical similarities by
integrating codebooks in SSL. The codebook serves as a concise and informative
dictionary of visual patterns, which not only aids in capturing nuanced
anatomical details but also facilitates the creation of robust and generalized
feature representations. In this context, we propose CoBooM, a novel framework
for self-supervised medical image learning by integrating continuous and
discrete representations. The continuous component ensures the preservation of
fine-grained details, while the discrete aspect facilitates coarse-grained
feature extraction through the structured embedding space. To understand the
effectiveness of CoBooM, we conduct a comprehensive evaluation of various
medical datasets encompassing chest X-rays and fundus images. The experimental
results reveal a significant performance gain in classification and
segmentation tasks.
comment: Accepted in MICCAI 2024
☆ Unveiling Hidden Visual Information: A Reconstruction Attack Against Adversarial Visual Information Hiding
This paper investigates the security vulnerabilities of
adversarial-example-based image encryption by executing data reconstruction
(DR) attacks on encrypted images. A representative image encryption method is
the adversarial visual information hiding (AVIH), which uses type-I adversarial
example training to protect gallery datasets used in image recognition tasks.
In the AVIH method, the type-I adversarial example approach creates images that
appear completely different but are still recognized by machines as the
original ones. Additionally, the AVIH method can restore encrypted images to
their original forms using a predefined private key generative model. For the
best security, assigning a unique key to each image is recommended; however,
storage limitations may necessitate some images sharing the same key model.
This raises a crucial security question for AVIH: How many images can safely
share the same key model without being compromised by a DR attack? To address
this question, we introduce a dual-strategy DR attack against the AVIH
encryption method by incorporating (1) generative-adversarial loss and (2)
augmented identity loss, which prevent DR from overfitting -- an issue akin to
that in machine learning. Our numerical results validate this approach through
image recognition and re-identification benchmarks, demonstrating that our
strategy can significantly enhance the quality of reconstructed images, thereby
requiring fewer key-sharing encrypted images. Our source code to reproduce our
results will be available soon.
comment: 12 pages
☆ UHNet: An Ultra-Lightweight and High-Speed Edge Detection Network
Edge detection is crucial in medical image processing, enabling precise
extraction of structural information to support lesion identification and image
analysis. Traditional edge detection models typically rely on complex
Convolutional Neural Networks and Vision Transformer architectures. Due to
their numerous parameters and high computational demands, these models are
limited in their application on resource-constrained devices. This paper
presents an ultra-lightweight edge detection model (UHNet), characterized by
its minimal parameter count, rapid computation speed, negligible of
pre-training costs, and commendable performance. UHNet boasts impressive
performance metrics with 42.3k parameters, 166 FPS, and 0.79G FLOPs. By
employing an innovative feature extraction module and optimized residual
connection method, UHNet significantly reduces model complexity and
computational requirements. Additionally, a lightweight feature fusion strategy
is explored, enhancing detection accuracy. Experimental results on the BSDS500,
NYUD, and BIPED datasets validate that UHNet achieves remarkable edge detection
performance while maintaining high efficiency. This work not only provides new
insights into the design of lightweight edge detection models but also
demonstrates the potential and application prospects of the UHNet model in
engineering applications such as medical image processing. The codes are
available at https://github.com/stoneLi20cv/UHNet
☆ InstantStyleGaussian: Efficient Art Style Transfer with 3D Gaussian Splatting
We present InstantStyleGaussian, an innovative 3D style transfer method based
on the 3D Gaussian Splatting (3DGS) scene representation. By inputting a target
style image, it quickly generates new 3D GS scenes. Our approach operates on
pre-reconstructed GS scenes, combining diffusion models with an improved
iterative dataset update strategy. It utilizes diffusion models to generate
target style images, adds these new images to the training dataset, and uses
this dataset to iteratively update and optimize the GS scenes. Extensive
experimental results demonstrate that our method ensures high-quality stylized
scenes while offering significant advantages in style transfer speed and
consistency.
☆ MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning
With the exponential growth of multimedia data, leveraging multimodal sensors
presents a promising approach for improving accuracy in human activity
recognition. Nevertheless, accurately identifying these activities using both
video data and wearable sensor data presents challenges due to the
labor-intensive data annotation, and reliance on external pretrained models or
additional data. To address these challenges, we introduce Multimodal Masked
Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal
masked autoencoder with a synchronized masking strategy tailored for wearable
sensors. This masking strategy compels the networks to capture more meaningful
spatiotemporal features, which enables effective self-supervised pretraining
without the need for external data. Furthermore, Mu-MAE leverages the
representation extracted from multimodal masked autoencoders as prior
information input to a cross-attention multimodal fusion layer. This fusion
layer emphasizes spatiotemporal features requiring attention across different
modalities while highlighting differences from other classes, aiding in the
classification of various classes in metric-based one-shot learning.
Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE
outperforms all the evaluated approaches, achieving up to an 80.17% accuracy
for five-way one-shot multimodal classification, without the use of additional
data.
comment: IEEE MIPR 2024
☆ LLDif: Diffusion Models for Low-light Emotion Recognition ICPR2024
This paper introduces LLDif, a novel diffusion-based facial expression
recognition (FER) framework tailored for extremely low-light (LL) environments.
Images captured under such conditions often suffer from low brightness and
significantly reduced contrast, presenting challenges to conventional methods.
These challenges include poor image quality that can significantly reduce the
accuracy of emotion recognition. LLDif addresses these issues with a novel
two-stage training process that combines a Label-aware CLIP (LA-CLIP), an
embedding prior network (PNET), and a transformer-based network adept at
handling the noise of low-light images. The first stage involves LA-CLIP
generating a joint embedding prior distribution (EPD) to guide the LLformer in
label recovery. In the second stage, the diffusion model (DM) refines the EPD
inference, ultilising the compactness of EPD for precise predictions.
Experimental evaluations on various LL-FER datasets have shown that LLDif
achieves competitive performance, underscoring its potential to enhance FER
applications in challenging lighting conditions.
comment: Accepted by ICPR2024
☆ Physical prior guided cooperative learning framework for joint turbulence degradation estimation and infrared video restoration
Infrared imaging and turbulence strength measurements are in widespread
demand in many fields. This paper introduces a Physical Prior Guided
Cooperative Learning (P2GCL) framework to jointly enhance atmospheric
turbulence strength estimation and infrared image restoration. P2GCL involves a
cyclic collaboration between two models, i.e., a TMNet measures turbulence
strength and outputs the refractive index structure constant (Cn2) as a
physical prior, a TRNet conducts infrared image sequence restoration based on
Cn2 and feeds the restored images back to the TMNet to boost the measurement
accuracy. A novel Cn2-guided frequency loss function and a physical constraint
loss are introduced to align the training process with physical theories.
Experiments demonstrate P2GCL achieves the best performance for both turbulence
strength estimation (improving Cn2 MAE by 0.0156, enhancing R2 by 0.1065) and
image restoration (enhancing PSNR by 0.2775 dB), validating the significant
impact of physical prior guided cooperative learning.
comment: 21
☆ Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance
Aerial imagery analysis is critical for many research fields. However,
obtaining frequent high-quality aerial images is not always accessible due to
its high effort and cost requirements. One solution is to use the
Ground-to-Aerial (G2A) technique to synthesize aerial images from easily
collectible ground images. However, G2A is rarely studied, because of its
challenges, including but not limited to, the drastic view changes, occlusion,
and range of visibility. In this paper, we present a novel Geometric Preserving
Ground-to-Aerial (G2A) image synthesis (GPG2A) model that can generate
realistic aerial images from ground images. GPG2A consists of two stages. The
first stage predicts the Bird's Eye View (BEV) segmentation (referred to as the
BEV layout map) from the ground image. The second stage synthesizes the aerial
image from the predicted BEV layout map and text descriptions of the ground
image. To train our model, we present a new multi-modal cross-view dataset,
namely VIGORv2 which is built upon VIGOR with newly collected aerial images,
maps, and text descriptions. Our extensive experiments illustrate that GPG2A
synthesizes better geometry-preserved aerial images than existing models. We
also present two applications, data augmentation for cross-view
geo-localization and sketch-based region search, to further verify the
effectiveness of our GPG2A. The code and data will be publicly available.
☆ VideoQA in the Era of LLMs: An Empirical Study
Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, Angela Yao
Video Large Language Models (Video-LLMs) are flourishing and has advanced
many video-language tasks. As a golden testbed, Video Question Answering
(VideoQA) plays pivotal role in Video-LLM developing. This work conducts a
timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to
elucidate their success and failure modes, and provide insights towards more
human-like video understanding and question answering. Our analyses demonstrate
that Video-LLMs excel in VideoQA; they can correlate contextual cues and
generate plausible responses to questions about varied video contents. However,
models falter in handling video temporality, both in reasoning about temporal
content ordering and grounding QA-relevant temporal moments. Moreover, the
models behave unintuitively - they are unresponsive to adversarial video
perturbations while being sensitive to simple variations of candidate answers
and questions. Also, they do not necessarily generalize better. The findings
demonstrate Video-LLMs' QA capability in standard condition yet highlight their
severe deficiency in robustness and interpretability, suggesting the urgent
need on rationales in Video-LLM developing.
comment: Preprint. Under Review
☆ Connective Viewpoints of Signal-to-Noise Diffusion Models
Khanh Doan, Long Tung Vuong, Tuan Nguyen, Anh Tuan Bui, Quyen Tran, Thanh-Toan Do, Dinh Phung, Trung Le
Diffusion models (DM) have become fundamental components of generative
models, excelling across various domains such as image creation, audio
generation, and complex data interpolation. Signal-to-Noise diffusion models
constitute a diverse family covering most state-of-the-art diffusion models.
While there have been several attempts to study Signal-to-Noise (S2N) diffusion
models from various perspectives, there remains a need for a comprehensive
study connecting different viewpoints and exploring new perspectives. In this
study, we offer a comprehensive perspective on noise schedulers, examining
their role through the lens of the signal-to-noise ratio (SNR) and its
connections to information theory. Building upon this framework, we have
developed a generalized backward equation to enhance the performance of the
inference process.
☆ Is SAM 2 Better than SAM in Medical Image Segmentation?
Segment Anything Model (SAM) demonstrated impressive performance in zero-shot
promptable segmentation on natural images. The recently released Segment
Anything Model 2 (SAM 2) model claims to have better performance than SAM on
images while extending the model's capabilities to video segmentation. It is
important to evaluate the recent model's ability in medical image segmentation
in a zero-shot promptable manner. In this work, we performed extensive studies
with multiple datasets from different imaging modalities to compare the
performance between SAM and SAM 2. We used two point prompt strategies: (i)
single positive prompt near the centroid of the target structure and (ii)
additional positive prompts placed randomly within the target structure. The
evaluation included 21 unique organ-modality combinations including abdominal
structures, cardiac structures, and fetal head images acquired from publicly
available MRI, CT, and Ultrasound datasets. The preliminary results, based on
2D images, indicate that while SAM 2 may perform slightly better in a few
cases, but it does not in general surpass SAM for medical image segmentation.
Especially when the contrast is lower like in CT, Ultrasound images, SAM 2
performs poorly than SAM. For MRI images, SAM 2 performs at par or better than
SAM. Similar to SAM, SAM 2 also suffers from over-segmentation issue especially
when the boundaries of the to-be-segmented organ is fuzzy in nature.
☆ Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation
We introduce a novel graph-based Retrieval-Augmented Generation (RAG)
framework specifically designed for the medical domain, called
\textbf{MedGraphRAG}, aimed at enhancing Large Language Model (LLM)
capabilities and generating evidence-based results, thereby improving safety
and reliability when handling private medical data. Our comprehensive pipeline
begins with a hybrid static-semantic approach to document chunking,
significantly improving context capture over traditional methods. Extracted
entities are used to create a three-tier hierarchical graph structure, linking
entities to foundational medical knowledge sourced from medical papers and
dictionaries. These entities are then interconnected to form meta-graphs, which
are merged based on semantic similarities to develop a comprehensive global
graph. This structure supports precise information retrieval and response
generation. The retrieval process employs a U-retrieve method to balance global
awareness and indexing efficiency of the LLM. Our approach is validated through
a comprehensive ablation study comparing various methods for document chunking,
graph construction, and information retrieval. The results not only demonstrate
that our hierarchical graph construction method consistently outperforms
state-of-the-art models on multiple medical Q\&A benchmarks, but also confirms
that the responses generated include source documentation, significantly
enhancing the reliability of medical LLMs in practical applications. Code will
be at: https://github.com/MedicineToken/Medical-Graph-RAG/tree/main
☆ pyBregMan: A Python library for Bregman Manifolds
A Bregman manifold is a synonym for a dually flat space in information
geometry which admits as a canonical divergence a Bregman divergence. Bregman
manifolds are induced by smooth strictly convex functions like the cumulant or
partition functions of regular exponential families, the negative entropy of
mixture families, or the characteristic functions of regular cones just to list
a few such convex Bregman generators. We describe the design of pyBregMan, a
library which implements generic operations on Bregman manifolds and
instantiate several common Bregman manifolds used in information sciences. At
the core of the library is the notion of Legendre-Fenchel duality inducing a
canonical pair of dual potential functions and dual Bregman divergences. The
library also implements the Fisher-Rao manifolds of categorical/multinomial
distributions and multivariate normal distributions. To demonstrate the use of
the pyBregMan kernel manipulating those Bregman and Fisher-Rao manifolds, the
library also provides several core algorithms for various applications in
statistics, machine learning, information fusion, and so on.
comment: 28 pages
☆ MultiColor: Image Colorization by Learning from Multiple Color Spaces
Deep networks have shown impressive performance in the image restoration
tasks, such as image colorization. However, we find that previous approaches
rely on the digital representation from single color model with a specific
mapping function, a.k.a., color space, during the colorization pipeline. In
this paper, we first investigate the modeling of different color spaces, and
find each of them exhibiting distinctive characteristics with unique
distribution of colors. The complementarity among multiple color spaces leads
to benefits for the image colorization task.
We present MultiColor, a new learning-based approach to automatically
colorize grayscale images that combines clues from multiple color spaces.
Specifically, we employ a set of dedicated colorization modules for individual
color space. Within each module, a transformer decoder is first employed to
refine color query embeddings and then a color mapper produces color channel
prediction using the embeddings and semantic features. With these predicted
color channels representing various color spaces, a complementary network is
designed to exploit the complementarity and generate pleasing and reasonable
colorized images. We conduct extensive experiments on real-world datasets, and
the results demonstrate superior performance over the state-of-the-arts.
☆ Rotation center identification based on geometric relationships for rotary motion deblurring
Non-blind rotary motion deblurring (RMD) aims to recover the latent clear
image from a rotary motion blurred (RMB) image. The rotation center is a
crucial input parameter in non-blind RMD methods. Existing methods directly
estimate the rotation center from the RMB image. However they always suffer
significant errors, and the performance of RMD is limited. For the assembled
imaging systems, the position of the rotation center remains fixed. Leveraging
this prior knowledge, we propose a geometric-based method for rotation center
identification and analyze its error range. Furthermore, we construct a RMB
imaging system. The experiment demonstrates that our method achieves less than
1-pixel error along a single axis (x-axis or y-axis). We utilize the
constructed imaging system to capture real RMB images, and experimental results
show that our method can help existing RMD approaches yield better RMD images.
☆ M2EF-NNs: Multimodal Multi-instance Evidence Fusion Neural Networks for Cancer Survival Prediction
Accurate cancer survival prediction is crucial for assisting clinical doctors
in formulating treatment plans. Multimodal data, including histopathological
images and genomic data, offer complementary and comprehensive information that
can greatly enhance the accuracy of this task. However, the current methods,
despite yielding promising results, suffer from two notable limitations: they
do not effectively utilize global context and disregard modal uncertainty. In
this study, we put forward a neural network model called M2EF-NNs, which
leverages multimodal and multi-instance evidence fusion techniques for accurate
cancer survival prediction. Specifically, to capture global information in the
images, we use a pre-trained Vision Transformer (ViT) model to obtain patch
feature embeddings of histopathological images. Then, we introduce a multimodal
attention module that uses genomic embeddings as queries and learns the
co-attention mapping between genomic and histopathological images to achieve an
early interaction fusion of multimodal information and better capture their
correlations. Subsequently, we are the first to apply the Dempster-Shafer
evidence theory (DST) to cancer survival prediction. We parameterize the
distribution of class probabilities using the processed multimodal features and
introduce subjective logic to estimate the uncertainty associated with
different modalities. By combining with the Dempster-Shafer theory, we can
dynamically adjust the weights of class probabilities after multimodal fusion
to achieve trusted survival prediction. Finally, Experimental validation on the
TCGA datasets confirms the significant improvements achieved by our proposed
method in cancer survival prediction and enhances the reliability of the model.
☆ Efficient Single Image Super-Resolution with Entropy Attention and Receptive Field Augmentation ACM MM 2024
Xiaole Zhao, Linze Li, Chengxing Xie, Xiaoming Zhang, Ting Jiang, Wenjie Lin, Shuaicheng Liu, Tianrui Li
Transformer-based deep models for single image super-resolution (SISR) have
greatly improved the performance of lightweight SISR tasks in recent years.
However, they often suffer from heavy computational burden and slow inference
due to the complex calculation of multi-head self-attention (MSA), seriously
hindering their practical application and deployment. In this work, we present
an efficient SR model to mitigate the dilemma between model efficiency and SR
performance, which is dubbed Entropy Attention and Receptive Field Augmentation
network (EARFA), and composed of a novel entropy attention (EA) and a shifting
large kernel attention (SLKA). From the perspective of information theory, EA
increases the entropy of intermediate features conditioned on a Gaussian
distribution, providing more informative input for subsequent reasoning. On the
other hand, SLKA extends the receptive field of SR models with the assistance
of channel shifting, which also favors to boost the diversity of hierarchical
features. Since the implementation of EA and SLKA does not involve complex
computations (such as extensive matrix multiplications), the proposed method
can achieve faster nonlinear inference than Transformer-based SR models while
maintaining better SR performance. Extensive experiments show that the proposed
model can significantly reduce the delay of model inference while achieving the
SR performance comparable with other advanced models.
comment: Accepted to ACM MM 2024
♻ ☆ Advancing Prompt Learning through an External Layer
Prompt learning represents a promising method for adapting pre-trained
vision-language models (VLMs) to various downstream tasks by learning a set of
text embeddings. One challenge inherent to these methods is the poor
generalization performance due to the invalidity of the learned text embeddings
for unseen tasks. A straightforward approach to bridge this gap is to freeze
the text embeddings in prompts, which results in a lack of capacity to adapt
VLMs for downstream tasks. To address this dilemma, we propose a paradigm
called EnPrompt with a novel External Layer (EnLa). Specifically, we propose a
textual external layer and learnable visual embeddings for adapting VLMs to
downstream tasks. The learnable external layer is built upon valid embeddings
of pre-trained CLIP. This design considers the balance of learning capabilities
between the two branches. To align the textual and visual features, we propose
a novel two-pronged approach: i) we introduce the optimal transport as the
discrepancy metric to align the vision and text modalities, and ii) we
introduce a novel strengthening feature to enhance the interaction between
these two modalities. Four representative experiments (i.e., base-to-novel
generalization, few-shot learning, cross-dataset generalization, domain shifts
generalization) across 15 datasets demonstrate that our method outperforms the
existing prompt learning method.
♻ ☆ Dual-View Data Hallucination with Semantic Relation Guidance for Few-Shot Image Recognition
Learning to recognize novel concepts from just a few image samples is very
challenging as the learned model is easily overfitted on the few data and
results in poor generalizability. One promising but underexplored solution is
to compensate the novel classes by generating plausible samples. However, most
existing works of this line exploit visual information only, rendering the
generated data easy to be distracted by some challenging factors contained in
the few available samples. Being aware of the semantic information in the
textual modality that reflects human concepts, this work proposes a novel
framework that exploits semantic relations to guide dual-view data
hallucination for few-shot image recognition. The proposed framework enables
generating more diverse and reasonable data samples for novel classes through
effective information transfer from base classes. Specifically, an
instance-view data hallucination module hallucinates each sample of a novel
class to generate new data by employing local semantic correlated attention and
global semantic feature fusion derived from base classes. Meanwhile, a
prototype-view data hallucination module exploits semantic-aware measure to
estimate the prototype of a novel class and the associated distribution from
the few samples, which thereby harvests the prototype as a more stable sample
and enables resampling a large number of samples. We conduct extensive
experiments and comparisons with state-of-the-art methods on several popular
few-shot benchmarks to verify the effectiveness of the proposed framework.
comment: Accepted by IEEE Transactions on Multimedia
♻ ☆ TPA3D: Triplane Attention for Fast Text-to-3D Generation ECCV2024
Due to the lack of large-scale text-3D correspondence data, recent text-to-3D
generation works mainly rely on utilizing 2D diffusion models for synthesizing
3D data. Since diffusion-based methods typically require significant
optimization time for both training and inference, the use of GAN-based models
would still be desirable for fast 3D generation. In this work, we propose
Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end
trainable GAN-based deep learning model for fast text-to-3D generation. With
only 3D shape data and their rendered 2D images observed during training, our
TPA3D is designed to retrieve detailed visual descriptions for synthesizing the
corresponding 3D mesh data. This is achieved by the proposed attention
mechanisms on the extracted sentence and word-level text features. In our
experiments, we show that TPA3D generates high-quality 3D textured shapes
aligned with fine-grained descriptions, while impressive computation efficiency
can be observed.
comment: ECCV2024
♻ ☆ Loss Functions and Metrics in Deep Learning
Juan Terven, Diana M. Cordova-Esparza, Alfonso Ramirez-Pedraza, Edgar A. Chavez-Urbiola, Julio A. Romero-Gonzalez
When training or evaluating deep learning models, two essential parts are
picking the proper loss function and deciding on performance metrics. In this
paper, we provide a comprehensive overview of the most common loss functions
and metrics used across many different types of deep learning tasks, from
general tasks such as regression and classification to more specific tasks in
Computer Vision and Natural Language Processing. We introduce the formula for
each loss and metric, discuss their strengths and limitations, and describe how
these methods can be applied to various problems within deep learning. We hope
this work serves as a reference for researchers and practitioners in the field,
helping them make informed decisions when selecting the most appropriate loss
function and performance metrics for their deep learning projects.
comment: 76 pages, 4 figures, 13 tables, 127 equations
♻ ☆ ESP-MedSAM: Efficient Self-Prompting SAM for Universal Domain-Generalized Image Segmentation
Qing Xu, Jiaxuan Li, Xiangjian He, Ziyu Liu, Zhen Chen, Wenting Duan, Chenxin Li, Maggie M. He, Fiseha B. Tesema, Wooi P. Cheah, Yi Wang, Rong Qu, Jonathan M. Garibaldi
The universality of deep neural networks across different modalities and
their generalization capabilities to unseen domains play an essential role in
medical image segmentation. The recent Segment Anything Model (SAM) has
demonstrated its potential in both settings. However, the huge computational
costs, demand for manual annotations as prompts and conflict-prone decoding
process of SAM degrade its generalizability and applicability in clinical
scenarios. To address these issues, we propose an efficient self-prompting SAM
for universal domain-generalized medical image segmentation, named ESP-MedSAM.
Specifically, we first devise the Multi-Modal Decoupled Knowledge Distillation
(MMDKD) strategy to construct a lightweight semi-parameter sharing image
encoder that produces discriminative visual features for diverse modalities.
Further, we introduce the Self-Patch Prompt Generator (SPPG) to automatically
generate high-quality dense prompt embeddings for guiding segmentation
decoding. Finally, we design the Query-Decoupled Modality Decoder (QDMD) that
leverages a one-to-one strategy to provide an independent decoding channel for
every modality. Extensive experiments indicate that ESP-MedSAM outperforms
state-of-the-arts in diverse medical imaging segmentation tasks, displaying
superior modality universality and generalization capabilities. Especially,
ESP-MedSAM uses only 4.5\% parameters compared to SAM-H. The source code is
available at https://github.com/xq141839/ESP-MedSAM.
comment: Under Review
♻ ☆ Study of detecting behavioral signatures within DeepFake videos
There is strong interest in the generation of synthetic video imagery of
people talking for various purposes, including entertainment, communication,
training, and advertisement. With the development of deep fake generation
models, synthetic video imagery will soon be visually indistinguishable to the
naked eye from a naturally capture video. In addition, many methods are
continuing to improve to avoid more careful, forensic visual analysis. Some
deep fake videos are produced through the use of facial puppetry, which
directly controls the head and face of the synthetic image through the
movements of the actor, allow the actor to 'puppet' the image of another. In
this paper, we address the question of whether one person's movements can be
distinguished from the original speaker by controlling the visual appearance of
the speaker but transferring the behavior signals from another source. We
conduct a study by comparing synthetic imagery that: 1) originates from a
different person speaking a different utterance, 2) originates from the same
person speaking a different utterance, and 3) originates from a different
person speaking the same utterance. Our study shows that synthetic videos in
all three cases are seen as less real and less engaging than the original
source video. Our results indicate that there could be a behavioral signature
that is detectable from a person's movements that is separate from their visual
appearance, and that this behavioral signature could be used to distinguish a
deep fake from a properly captured video.
comment: 9 pages
♻ ☆ Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation
Diffusion-based text-to-image generation models trained on extensive
text-image pairs have shown the capacity to generate photorealistic images
consistent with textual descriptions. However, a significant limitation of
these models is their slow sample generation, which requires iterative
refinement through the same network. In this paper, we enhance Score identity
Distillation (SiD) by developing long and short classifier-free guidance (LSG)
to efficiently distill pretrained Stable Diffusion models without using real
training data. SiD aims to optimize a model-based explicit score matching loss,
utilizing a score-identity-based approximation alongside the proposed LSG for
practical computation. By training exclusively with fake images synthesized
with its one-step generator, SiD equipped with LSG rapidly improves FID and
CLIP scores, achieving state-of-the-art FID performance while maintaining a
competitive CLIP score. Specifically, its data-free distillation of Stable
Diffusion 1.5 achieves a record low FID of 8.15 on the COCO-2014 validation
set, with a CLIP score of 0.304 at an LSG scale of 1.5, and an FID of 9.56 with
a CLIP score of 0.313 at an LSG scale of 2. Our code and distilled one-step
text-to-image generators are available at
https://github.com/mingyuanzhou/SiD-LSG.
comment: Code and model checkpoints available at
https://github.com/mingyuanzhou/SiD-LSG
♻ ☆ Smooth Deep Saliency
In this work, we investigate methods to reduce the noise in deep saliency
maps coming from convolutional downsampling. Those methods make the
investigated models more interpretable for gradient-based saliency maps,
computed in hidden layers. We evaluate the faithfulness of those methods using
insertion and deletion metrics, finding that saliency maps computed in hidden
layers perform better compared to both the input layer and GradCAM. We test our
approach on different models trained for image classification on ImageNet1K,
and models trained for tumor detection on Camelyon16 and in-house real-world
digital pathology scans of stained tissue samples. Our results show that the
checkerboard noise in the gradient gets reduced, resulting in smoother and
therefore easier to interpret saliency maps.
♻ ☆ FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression
Alireza Furutanpey, Qiyang Zhang, Philipp Raith, Tobias Pfandzelter, Shangguang Wang, Schahram Dustdar
Nanosatellite constellations equipped with sensors capturing large geographic
regions provide unprecedented opportunities for Earth observation. As
constellation sizes increase, network contention poses a downlink bottleneck.
Orbital Edge Computing (OEC) leverages limited onboard compute resources to
reduce transfer costs by processing the raw captures at the source. However,
current solutions have limited practicability due to reliance on crude
filtering methods or over-prioritizing particular downstream tasks.
This work presents FOOL, an OEC-native and task-agnostic feature compression
method that preserves prediction performance. FOOL partitions high-resolution
satellite imagery to maximize throughput. Further, it embeds context and
leverages inter-tile dependencies to lower transfer costs with negligible
overhead. While FOOL is a feature compressor, it can recover images with
competitive scores on quality measures at lower bitrates. We extensively
evaluate transfer cost reduction by including the peculiarity of intermittently
available network connections in low earth orbit. Lastly, we test the
feasibility of our system for standardized nanosatellite form factors. We
demonstrate that FOOL permits downlinking over 100x the data volume without
relying on prior information on the downstream tasks.
comment: 18 pages, double column, 19 figures, 7 tables, Revision 1
♻ ☆ Color Mismatches in Stereoscopic Video: Real-World Dataset and Deep Correction Method
Stereoscopic videos can contain color mismatches between the left and right
views due to minor variations in camera settings, lenses, and even object
reflections captured from different positions. The presence of color mismatches
can lead to viewer discomfort and headaches. This problem can be solved by
transferring color between stereoscopic views, but traditional methods often
lack quality, while neural-network-based methods can easily overfit on
artificial data. The scarcity of stereoscopic videos with real-world color
mismatches hinders the evaluation of different methods' performance. Therefore,
we filmed a video dataset, which includes both distorted frames with color
mismatches and ground-truth data, using a beam-splitter. Our second
contribution is a deep multiscale neural network that solves the
color-mismatch-correction task by leveraging stereo correspondences. The
experimental results demonstrate the effectiveness of the proposed method on a
conventional dataset, but there remains room for improvement on challenging
real-world data.
comment: The code and datasets are at
https://github.com/egorchistov/color-transfer/
♻ ☆ Unsupervised Mastoidectomy for Cochlear CT Mesh Reconstruction Using Highly Noisy Data
Cochlear Implant (CI) procedures involve inserting an array of electrodes
into the cochlea located inside the inner ear. Mastoidectomy is a surgical
procedure that uses a high-speed drill to remove part of the mastoid region of
the temporal bone, providing safe access to the cochlea through the middle and
inner ear. We aim to develop an intraoperative navigation system that registers
plans created using 3D preoperative Computerized Tomography (CT) volumes with
the 2D surgical microscope view. Herein, we propose a method to synthesize the
mastoidectomy volume using only the preoperative CT scan, where the mastoid is
intact. We introduce an unsupervised learning framework designed to synthesize
mastoidectomy. For model training purposes, this method uses postoperative CT
scans to avoid manual data cleaning or labeling, even when the region removed
during mastoidectomy is visible but affected by metal artifacts, low
signal-to-noise ratio, or electrode wiring. Our approach estimates
mastoidectomy regions with a mean dice score of 70.0%. This approach represents
a major step forward for CI intraoperative navigation by predicting realistic
mastoidectomy-removed regions in preoperative planning that can be used to
register the pre-surgery plan to intraoperative microscopy.
♻ ☆ Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey
The recent enthusiasm for open-world vision systems show the high interest of
the community to perform perception tasks outside of the closed-vocabulary
benchmark setups which have been so popular until now. Being able to discover
objects in images/videos without knowing in advance what objects populate the
dataset is an exciting prospect. But how to find objects without knowing
anything about them? Recent works show that it is possible to perform
class-agnostic unsupervised object localization by exploiting self-supervised
pre-trained features. We propose here a survey of unsupervised object
localization methods that discover objects in images without requiring any
manual annotation in the era of self-supervised ViTs. We gather links of
discussed methods in the repository
https://github.com/valeoai/Awesome-Unsupervised-Object-Localization.
comment: IJCV 2024
♻ ☆ MS-Twins: Multi-Scale Deep Self-Attention Networks for Medical Image Segmentation
Chest X-ray is one of the most common radiological examination types for the
diagnosis of chest diseases. Nowadays, the automatic classification technology
of radiological images has been widely used in clinical diagnosis and treatment
plans. However, each disease has its own different response characteristic
receptive field region, which is the main challenge for chest disease
classification tasks. Besides, the imbalance of sample data categories further
increases the difficulty of tasks. To solve these problems, we propose a new
multi-label chest disease image classification scheme based on a multi-scale
attention network. In this scheme, multi-scale information is iteratively fused
to focus on regions with a high probability of disease, to effectively mine
more meaningful information from data, and the classification performance can
be improved only by image level annotation. We also designed a new loss
function to improve the rationality of visual perception and the performance of
multi-label image classification by forcing the consistency of attention
regions before and after image transformation. A comprehensive experiment was
carried out on the public Chest X-Ray14 and CheXpert datasets to achieve state
of the art results, which verified the effectiveness of this method in chest
X-ray image classification.
♻ ☆ GMISeg: General Medical Image Segmentation without Re-Training
The online shopping behavior has the characteristics of rich granularity
dimension and data sparsity and previous researches on user behavior prediction
did not seriously discuss feature selection and ensemble design. In this paper,
we proposed a SE-Stacking model based on information fusion and ensemble
learning for user purchase behavior prediction. After successfully utilizing
the ensemble feature selection method to screen purchase-related factors, we
used the Stacking algorithm for user purchase behavior prediction. In our
efforts to avoid the deviation of prediction results, we optimized the model by
selecting ten different kinds of models as base learners and modifying relevant
parameters specifically for them. The experiments conducted on a
publicly-available dataset shows that the SE-Stacking model can achieve a
98.40% F1-score, about 0.09% higher than the optimal base models. The
SE-Stacking model not only has a good application in the prediction of user
purchase behavior but also has practical value combining with the actual
e-commerce scene. At the same time, it has important significance for academic
research and the development of this field.
♻ ☆ 3DSS-Mamba: 3D-Spectral-Spatial Mamba for Hyperspectral Image Classification
Hyperspectral image (HSI) classification constitutes the fundamental research
in remote sensing fields. Convolutional Neural Networks (CNNs) and Transformers
have demonstrated impressive capability in capturing spectral-spatial
contextual dependencies. However, these architectures suffer from limited
receptive fields and quadratic computational complexity, respectively.
Fortunately, recent Mamba architectures built upon the State Space Model
integrate the advantages of long-range sequence modeling and linear
computational efficiency, exhibiting substantial potential in low-dimensional
scenarios. Motivated by this, we propose a novel 3D-Spectral-Spatial Mamba
(3DSS-Mamba) framework for HSI classification, allowing for global
spectral-spatial relationship modeling with greater computational efficiency.
Technically, a spectral-spatial token generation (SSTG) module is designed to
convert the HSI cube into a set of 3D spectral-spatial tokens. To overcome the
limitations of traditional Mamba, which is confined to modeling causal
sequences and inadaptable to high-dimensional scenarios, a 3D-Spectral-Spatial
Selective Scanning (3DSS) mechanism is introduced, which performs pixel-wise
selective scanning on 3D hyperspectral tokens along the spectral and spatial
dimensions. Five scanning routes are constructed to investigate the impact of
dimension prioritization. The 3DSS scanning mechanism combined with
conventional mapping operations forms the 3D-spectral-spatial mamba block
(3DMB), enabling the extraction of global spectral-spatial semantic
representations. Experimental results and analysis demonstrate that the
proposed method outperforms the state-of-the-art methods on HSI classification
benchmarks.
♻ ☆ RRWNet: Recursive Refinement Network for Effective Retinal Artery/Vein Segmentation and Classification
The caliber and configuration of retinal blood vessels serve as important
biomarkers for various diseases and medical conditions. A thorough analysis of
the retinal vasculature requires the segmentation of the blood vessels and
their classification into arteries and veins, typically performed on color
fundus images obtained by retinography. However, manually performing these
tasks is labor-intensive and prone to human error. While several automated
methods have been proposed to address this task, the current state of art faces
challenges due to manifest classification errors affecting the topological
consistency of segmentation maps. In this work, we introduce RRWNet, a novel
end-to-end deep learning framework that addresses this limitation. The
framework consists of a fully convolutional neural network that recursively
refines semantic segmentation maps, correcting manifest classification errors
and thus improving topological consistency. In particular, RRWNet is composed
of two specialized subnetworks: a Base subnetwork that generates base
segmentation maps from the input images, and a Recursive Refinement subnetwork
that iteratively and recursively improves these maps. Evaluation on three
different public datasets demonstrates the state-of-the-art performance of the
proposed method, yielding more topologically consistent segmentation maps with
fewer manifest classification errors than existing approaches. In addition, the
Recursive Refinement module within RRWNet proves effective in post-processing
segmentation maps from other methods, further demonstrating its potential. The
model code, weights, and predictions will be publicly available at
https://github.com/j-morano/rrwnet.
♻ ☆ RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network
Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang
Person-generic audio-driven face generation is a challenging task in computer
vision. Previous methods have achieved remarkable progress in audio-visual
synchronization, but there is still a significant gap between current results
and practical applications. The challenges are two-fold: 1) Preserving unique
individual traits for achieving high-precision lip synchronization. 2)
Generating high-quality facial renderings in real-time performance. In this
paper, we propose a novel generalized audio-driven framework RealTalk, which
consists of an audio-to-expression transformer and a high-fidelity
expression-to-face renderer. In the first component, we consider both identity
and intra-personal variation features related to speaking lip movements. By
incorporating cross-modal attention on the enriched facial priors, we can
effectively align lip movements with audio, thus attaining greater precision in
expression prediction. In the second component, we design a lightweight facial
identity alignment (FIA) module which includes a lip-shape control structure
and a face texture reference structure. This novel design allows us to generate
fine details in real-time, without depending on sophisticated and inefficient
feature alignment modules. Our experimental results, both quantitative and
qualitative, on public datasets demonstrate the clear advantages of our method
in terms of lip-speech synchronization and generation quality. Furthermore, our
method is efficient and requires fewer computational resources, making it
well-suited to meet the needs of practical applications.
♻ ☆ P2LHAP:Wearable sensor-based human activity recognition, segmentation and forecast through Patch-to-Label Seq2Seq Transformer
Traditional deep learning methods struggle to simultaneously segment,
recognize, and forecast human activities from sensor data. This limits their
usefulness in many fields such as healthcare and assisted living, where
real-time understanding of ongoing and upcoming activities is crucial. This
paper introduces P2LHAP, a novel Patch-to-Label Seq2Seq framework that tackles
all three tasks in a efficient single-task model. P2LHAP divides sensor data
streams into a sequence of "patches", served as input tokens, and outputs a
sequence of patch-level activity labels including the predicted future
activities. A unique smoothing technique based on surrounding patch labels, is
proposed to identify activity boundaries accurately. Additionally, P2LHAP
learns patch-level representation by sensor signal channel-independent
Transformer encoders and decoders. All channels share embedding and Transformer
weights across all sequences. Evaluated on three public datasets, P2LHAP
significantly outperforms the state-of-the-art in all three tasks,
demonstrating its effectiveness and potential for real-world applications.
♻ ☆ GenAD: Generalized Predictive Model for Autonomous Driving CVPR 2024
Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li
In this paper, we introduce the first large-scale video prediction model in
the autonomous driving discipline. To eliminate the restriction of high-cost
data collection and empower the generalization ability of our model, we acquire
massive data from the web and pair it with diverse and high-quality text
descriptions. The resultant dataset accumulates over 2000 hours of driving
videos, spanning areas all over the world with diverse weather conditions and
traffic scenarios. Inheriting the merits from recent latent diffusion models,
our model, dubbed GenAD, handles the challenging dynamics in driving scenes
with novel temporal reasoning blocks. We showcase that it can generalize to
various unseen driving datasets in a zero-shot manner, surpassing general or
driving-specific video prediction counterparts. Furthermore, GenAD can be
adapted into an action-conditioned prediction model or a motion planner,
holding great potential for real-world driving applications.
comment: CVPR 2024 Highlight Paper. Dataset:
https://github.com/OpenDriveLab/DriveAGI
♻ ☆ SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses ACM MM 2024
Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu
Video grounding is a fundamental problem in multimodal content understanding,
aiming to localize specific natural language queries in an untrimmed video.
However, current video grounding datasets merely focus on simple events and are
either limited to shorter videos or brief sentences, which hinders the model
from evolving toward stronger multimodal understanding capabilities. To address
these limitations, we present a large-scale video grounding dataset named
SynopGround, in which more than 2800 hours of videos are sourced from popular
TV dramas and are paired with accurately localized human-written synopses. Each
paragraph in the synopsis serves as a language query and is manually annotated
with precise temporal boundaries in the long video. These paragraph queries are
tightly correlated to each other and contain a wealth of abstract expressions
summarizing video storylines and specific descriptions portraying event
details, which enables the model to learn multimodal perception on more
intricate concepts over longer context dependencies. Based on the dataset, we
further introduce a more complex setting of video grounding dubbed
Multi-Paragraph Video Grounding (MPVG), which takes as input multiple
paragraphs and a long video for grounding each paragraph query to its temporal
interval. In addition, we propose a novel Local-Global Multimodal Reasoner
(LGMR) to explicitly model the local-global structures of long-term multimodal
inputs for MPVG. Our method provides an effective baseline solution to the
multi-paragraph video grounding problem. Extensive experiments verify the
proposed model's effectiveness as well as its superiority in long-term
multi-paragraph video grounding over prior state-of-the-arts. Dataset and code
are publicly available. Project page: https://synopground.github.io/.
comment: Accepted to ACM MM 2024. Project page: https://synopground.github.io/
♻ ☆ Self-supervised visual learning from interactions with objects
Self-supervised learning (SSL) has revolutionized visual representation
learning, but has not achieved the robustness of human vision. A reason for
this could be that SSL does not leverage all the data available to humans
during learning. When learning about an object, humans often purposefully turn
or move around objects and research suggests that these interactions can
substantially enhance their learning. Here we explore whether such
object-related actions can boost SSL. For this, we extract the actions
performed to change from one ego-centric view of an object to another in four
video datasets. We then introduce a new loss function to learn visual and
action embeddings by aligning the performed action with the representations of
two images extracted from the same clip. This permits the performed actions to
structure the latent visual representation. Our experiments show that our
method consistently outperforms previous methods on downstream category
recognition. In our analysis, we find that the observed improvement is
associated with a better viewpoint-wise alignment of different objects from the
same category. Overall, our work demonstrates that embodied interactions with
objects can improve SSL of object categories.
♻ ☆ Fast and Accurate Object Detection on Asymmetrical Receptive Field
Object detection has been used in a wide range of industries. For example, in
autonomous driving, the task of object detection is to accurately and
efficiently identify and locate a large number of predefined classes of object
instances (vehicles, pedestrians, traffic signs, etc.) from videos of roads. In
robotics, the industry robot needs to recognize specific machine elements. In
the security field, the camera should accurately recognize each face of people.
With the wide application of deep learning, the accuracy and efficiency of
object detection have been greatly improved, but object detection based on deep
learning still faces challenges. Different applications of object detection
have different requirements, including highly accurate detection,
multi-category object detection, real-time detection, robustness to occlusions,
etc. To address the above challenges, based on extensive literature research,
this paper analyzes methods for improving and optimizing mainstream object
detection algorithms from the perspective of evolution of one-stage and
two-stage object detection algorithms. Furthermore, this article proposes
methods for improving object detection accuracy from the perspective of
changing receptive fields. The new model is based on the original YOLOv5 (You
Look Only Once) with some modifications. The structure of the head part of
YOLOv5 is modified by adding asymmetrical pooling layers. As a result, the
accuracy of the algorithm is improved while ensuring the speed. The
performances of the new model in this article are compared with original YOLOv5
model and analyzed from several parameters. And the evaluation of the new model
is presented in four situations. Moreover, the summary and outlooks are made on
the problems to be solved and the research directions in the future.
♻ ☆ HARMamba: Efficient and Lightweight Wearable Sensor Human Activity Recognition Based on Bidirectional Mamba
Wearable sensor-based human activity recognition (HAR) is a critical research
domain in activity perception. However, achieving high efficiency and long
sequence recognition remains a challenge. Despite the extensive investigation
of temporal deep learning models, such as CNNs, RNNs, and transformers, their
extensive parameters often pose significant computational and memory
constraints, rendering them less suitable for resource-constrained mobile
health applications. This study introduces HARMamba, an innovative light-weight
and versatile HAR architecture that combines selective bidirectional State
Spaces Model and hardware-aware design. To optimize real-time resource
consumption in practical scenarios, HARMamba employs linear recursive
mechanisms and parameter discretization, allowing it to selectively focus on
relevant input sequences while efficiently fusing scan and recompute
operations. The model employs independent channels to process sensor data
streams, dividing each channel into patches and appending classification tokens
to the end of the sequence. It utilizes position embedding to represent the
sequence order. The patch sequence is subsequently processed by HARMamba Block,
and the classification head finally outputs the activity category. The HARMamba
Block serves as the fundamental component of the HARMamba architecture,
enabling the effective capture of more discriminative activity sequence
features. HARMamba outperforms contemporary state-of-the-art frameworks,
delivering comparable or better accuracy with significantly reducing
computational and memory demands. It's effectiveness has been extensively
validated on 4 publically available datasets namely PAMAP2, WISDM, UNIMIB SHAR
and UCI. The F1 scores of HARMamba on the four datasets are 99.74%, 99.20%,
88.23% and 97.01%, respectively.
♻ ☆ Edit As You Wish: Video Caption Editing with Multi-grained User Control ACM MM 2024
Automatically narrating videos in natural language complying with user
requests, i.e. Controllable Video Captioning task, can help people manage
massive videos with desired intentions. However, existing works suffer from two
shortcomings: 1) the control signal is single-grained which can not satisfy
diverse user intentions; 2) the video description is generated in a single
round which can not be further edited to meet dynamic needs. In this paper, we
propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)}
task to automatically revise an existing video description guided by
multi-grained user requests. Inspired by human writing-revision habits, we
design the user command as a pivotal triplet \{\textit{operation, position,
attribute}\} to cover diverse user needs from coarse-grained to fine-grained.
To facilitate the VCE task, we \textit{automatically} construct an open-domain
benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce
dataset called EMMAD-EDIT. We further propose a specialized small-scale model
(i.e., OPA) compared with two generalist Large Multi-modal Models to perform an
exhaustive analysis of the novel task. For evaluation, we adopt comprehensive
metrics considering caption fluency, command-caption consistency, and
video-caption alignment. Experiments reveal the task challenges of fine-grained
multi-modal semantics understanding and processing. Our datasets, codes, and
evaluation tools are available at https://github.com/yaolinli/VCE.
comment: Accepted by ACM MM 2024
♻ ☆ View-Consistent 3D Editing with Gaussian Splatting ECCV 2024
The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing,
offering efficient, high-fidelity rendering and enabling precise local
manipulations. Currently, diffusion-based 2D editing models are harnessed to
modify multi-view rendered images, which then guide the editing of 3DGS models.
However, this approach faces a critical issue of multi-view inconsistency,
where the guidance images exhibit significant discrepancies across views,
leading to mode collapse and visual artifacts of 3DGS. To this end, we
introduce View-consistent Editing (VcEdit), a novel framework that seamlessly
incorporates 3DGS into image editing processes, ensuring multi-view consistency
in edited guidance images and effectively mitigating mode collapse issues.
VcEdit employs two innovative consistency modules: the Cross-attention
Consistency Module and the Editing Consistency Module, both designed to reduce
inconsistencies in edited images. By incorporating these consistency modules
into an iterative pattern, VcEdit proficiently resolves the issue of multi-view
inconsistency, facilitating high-quality 3DGS editing across a diverse range of
scenes. Further video results are shown in http://vcedit.github.io.
comment: accepted to ECCV 2024
♻ ☆ CCVA-FL: Cross-Client Variations Adaptive Federated Learning for Medical Imaging
Federated Learning (FL) offers a privacy-preserving approach to train models
on decentralized data. Its potential in healthcare is significant, but
challenges arise due to cross-client variations in medical image data,
exacerbated by limited annotations. This paper introduces Cross-Client
Variations Adaptive Federated Learning (CCVA-FL) to address these issues.
CCVA-FL aims to minimize cross-client variations by transforming images into a
common feature space. It involves expert annotation of a subset of images from
each client, followed by the selection of a client with the least data
complexity as the target. Synthetic medical images are then generated using
Scalable Diffusion Models with Transformers (DiT) based on the target client's
annotated images. These synthetic images, capturing diversity and representing
the original data, are shared with other clients. Each client then translates
its local images into the target image space using image-to-image translation.
The translated images are subsequently used in a federated learning setting to
develop a server model. Our results demonstrate that CCVA-FL outperforms
Vanilla Federated Averaging by effectively addressing data distribution
differences across clients without compromising privacy.
comment: I found critical errors in the manuscript affecting its validity. I
need to correct these before resubmitting. Major changes to methodology and
results are underway, significantly altering the content. I will resubmit the
revised version
♻ ☆ SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor
Xianfu Cheng, Weixiao Zhou, Xiang Li, Jian Yang, Hang Zhang, Tao Sun, Wei Zhang, Yuying Mai, Tongliang Li, Xiaoming Chen, Zhoujun Li
Scene Text Recognition (STR) is an important and challenging upstream task
for building structured information databases, that involves recognizing text
within images of natural scenes. Although current state-of-the-art (SOTA)
models for STR exhibit high performance, they typically suffer from low
inference efficiency due to their reliance on hybrid architectures comprised of
visual encoders and sequence decoders. In this work, we propose a VIsion
Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR),
which achieves an impressive balance between high performance and rapid
inference speeds in the domain of STR. Specifically, SVIPTR leverages a
visual-semantic extractor with a pyramid structure, characterized by the
Permutation and combination of local and global self-attention layers. This
design results in a lightweight and efficient model and its inference is
insensitive to input length. Extensive experimental results on various standard
datasets for both Chinese and English scene text recognition validate the
superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly
competitive accuracy on par with other lightweight models and achieves SOTA
inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in
single-encoder-type models, while maintaining a low parameter count and
favorable inference speed. Our proposed method provides a compelling solution
for the STR challenge, which greatly benefits real-world applications requiring
fast and efficient STR. The code is publicly available at
https://github.com/cxfyxl/VIPTR.
comment: 10 pages, 4 figures, 6 tables
♻ ☆ Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets
Multi-baseline SAR 3D imaging faces significant challenges due to data
sparsity. In recent years, deep learning techniques have achieved notable
success in enhancing the quality of sparse SAR 3D imaging. However, previous
work typically rely on full-aperture high-resolution radar images to supervise
the training of deep neural networks (DNNs), utilizing only single-modal
information from radar data. Consequently, imaging performance is limited, and
acquiring full-aperture data for multi-baseline SAR is costly and sometimes
impractical in real-world applications. In this paper, we propose a Cross-Modal
Reconstruction Network (CMR-Net), which integrates differentiable render and
cross-modal supervision with optical images to reconstruct highly sparse
multi-baseline SAR 3D images of vehicle targets into visually structured and
high-resolution images. We meticulously designed the network architecture and
training strategies to enhance network generalization capability. Remarkably,
CMR-Net, trained solely on simulated data, demonstrates high-resolution
reconstruction capabilities on both publicly available simulation datasets and
real measured datasets, outperforming traditional sparse reconstruction
algorithms based on compressed sensing and other learning-based methods.
Additionally, using optical images as supervision provides a cost-effective way
to build training datasets, reducing the difficulty of method dissemination.
Our work showcases the broad prospects of deep learning in multi-baseline SAR
3D imaging and offers a novel path for researching radar imaging based on
cross-modal learning theory.
♻ ☆ DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models
Emotional talking head generation has attracted growing attention. Previous
methods, which are mainly GAN-based, still struggle to consistently produce
satisfactory results across diverse emotions and cannot conveniently specify
personalized emotions. In this work, we leverage powerful diffusion models to
address the issue and propose DreamTalk, a framework that employs meticulous
design to unlock the potential of diffusion models in generating emotional
talking heads. Specifically, DreamTalk consists of three crucial components: a
denoising network, a style-aware lip expert, and a style predictor. The
diffusion-based denoising network can consistently synthesize high-quality
audio-driven face motions across diverse emotions. To enhance lip-motion
accuracy and emotional fullness, we introduce a style-aware lip expert that can
guide lip-sync while preserving emotion intensity. To more conveniently specify
personalized emotions, a diffusion-based style predictor is utilized to predict
the personalized emotion directly from the audio, eliminating the need for
extra emotion reference. By this means, DreamTalk can consistently generate
vivid talking faces across diverse emotions and conveniently specify
personalized emotions. Extensive experiments validate DreamTalk's effectiveness
and superiority. The code is available at
https://github.com/ali-vilab/dreamtalk.
comment: Project Page: https://dreamtalk-project.github.io
♻ ☆ Harmonized Spatial and Spectral Learning for Robust and Generalized Medical Image Segmentation ICPR-2024
Deep learning has demonstrated remarkable achievements in medical image
segmentation. However, prevailing deep learning models struggle with poor
generalization due to (i) intra-class variations, where the same class appears
differently in different samples, and (ii) inter-class independence, resulting
in difficulties capturing intricate relationships between distinct objects,
leading to higher false negative cases. This paper presents a novel approach
that synergies spatial and spectral representations to enhance
domain-generalized medical image segmentation. We introduce the innovative
Spectral Correlation Coefficient objective to improve the model's capacity to
capture middle-order features and contextual long-range dependencies. This
objective complements traditional spatial objectives by incorporating valuable
spectral information. Extensive experiments reveal that optimizing this
objective with existing architectures like UNet and TransUNet significantly
enhances generalization, interpretability, and noise robustness, producing more
confident predictions. For instance, in cardiac segmentation, we observe a 0.81
pp and 1.63 pp (pp = percentage point) improvement in DSC over UNet and
TransUNet, respectively. Our interpretability study demonstrates that, in most
tasks, objectives optimized with UNet outperform even TransUNet by introducing
global contextual information alongside local details. These findings
underscore the versatility and effectiveness of our proposed method across
diverse imaging modalities and medical domains.
comment: Early Accepted at ICPR-2024 for Oral Presentation
♻ ☆ Nighttime Pedestrian Detection Based on Fore-Background Contrast Learning
The significance of background information is frequently overlooked in
contemporary research concerning channel attention mechanisms. This study
addresses the issue of suboptimal single-spectral nighttime pedestrian
detection performance under low-light conditions by incorporating background
information into the channel attention mechanism. Despite numerous studies
focusing on the development of efficient channel attention mechanisms, the
relevance of background information has been largely disregarded. By adopting a
contrast learning approach, we reexamine channel attention with regard to
pedestrian objects and background information for nighttime pedestrian
detection, resulting in the proposed Fore-Background Contrast Attention (FBCA).
FBCA possesses two primary attributes: (1) channel descriptors form remote
dependencies with global spatial feature information; (2) the integration of
background information enhances the distinction between channels concentrating
on low-light pedestrian features and those focusing on background information.
Consequently, the acquired channel descriptors exhibit a higher semantic level
and spatial accuracy. Experimental outcomes demonstrate that FBCA significantly
outperforms existing methods in single-spectral nighttime pedestrian detection,
achieving state-of-the-art results on the NightOwls and TJU-DHD-pedestrian
datasets. Furthermore, this methodology also yields performance improvements
for the multispectral LLVIP dataset. These findings indicate that integrating
background information into the channel attention mechanism effectively
mitigates detector performance degradation caused by illumination factors in
nighttime scenarios.
♻ ☆ Rethinking Feature Backbone Fine-tuning for Remote Sensing Object Detection
Recently, numerous methods have achieved impressive performance in remote
sensing object detection, relying on convolution or transformer architectures.
Such detectors typically have a feature backbone to extract useful features
from raw input images. For the remote sensing domain, a common practice among
current detectors is to initialize the backbone with pre-training on ImageNet
consisting of natural scenes. Fine-tuning the backbone is then typically
required to generate features suitable for remote-sensing images. However, this
could hinder the extraction of basic visual features in long-term training,
thus restricting performance improvement. To mitigate this issue, we propose a
novel method named DBF (Dynamic Backbone Freezing) for feature backbone
fine-tuning on remote sensing object detection. Our method aims to handle the
dilemma of whether the backbone should extract low-level generic features or
possess specific knowledge of the remote sensing domain, by introducing a
module called 'Freezing Scheduler' to dynamically manage the update of backbone
features during training. Extensive experiments on DOTA and DIOR-R show that
our approach enables more accurate model learning while substantially reducing
computational costs. Our method can be seamlessly adopted without additional
effort due to its straightforward design.
comment: Under Review
♻ ☆ The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024
This paper delineates the visual speech recognition (VSR) system introduced
by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech
Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the
fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In
terms of data processing, we leverage the lip motion extractor from the
baseline1 to produce multiscale video data. Besides, various augmentation
techniques are applied during training, encompassing speed perturbation, random
rotation, horizontal flipping, and color transformation. The VSR model adopts
an end-to-end architecture with joint CTC/attention loss, introducing Enhanced
ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional
Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker
Task and 34.30% CER for the Multi-Speaker Task, securing second place in the
open track of the Single-Speaker Task and first place in the other three
tracks.
comment: 2 pages, 2 figures, CNVSRC 2024 System Report
♻ ☆ WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation ECCV 2024
Generating multi-camera street-view videos is critical for augmenting
autonomous driving datasets, addressing the urgent demand for extensive and
varied data. Due to the limitations in diversity and challenges in handling
lighting conditions, traditional rendering-based methods are increasingly being
supplanted by diffusion-based methods. However, a significant challenge in
diffusion-based methods is ensuring that the generated sensor data preserve
both intra-world consistency and inter-sensor coherence. To address these
challenges, we combine an additional explicit world volume and propose the
World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system
is specifically designed to leverage 4D world volume as a foundational element
for video generation. Our model operates in two distinct phases: (i)
envisioning the future 4D temporal world volume based on vehicle control
sequences, and (ii) generating multi-camera videos, informed by this envisioned
4D temporal world volume and sensor interconnectivity. The incorporation of the
4D world volume empowers WoVoGen not only to generate high-quality street-view
videos in response to vehicle control inputs but also to facilitate scene
editing tasks.
comment: ECCV 2024
♻ ☆ Recent Deep Semi-supervised Learning Approaches and Related Works
This work proposes an overview of the recent semi-supervised learning
approaches and related works. Despite the remarkable success of neural networks
in various applications, there exist a few formidable constraints, including
the need for a large amount of labeled data. Therefore, semi-supervised
learning, which is a learning scheme in which scarce labels and a larger amount
of unlabeled data are utilized to train models (e.g., deep neural networks), is
getting more important. Based on the key assumptions of semi-supervised
learning, which are the manifold assumption, cluster assumption, and continuity
assumption, the work reviews the recent semi-supervised learning approaches. In
particular, the methods in regard to using deep neural networks in a
semi-supervised learning setting are primarily discussed. In addition, the
existing works are first classified based on the underlying idea and explained,
then the holistic approaches that unify the aforementioned ideas are detailed.
♻ ☆ 3D Structure-guided Network for Tooth Alignment in 2D Photograph BMVC 2023
Orthodontics focuses on rectifying misaligned teeth (i.e., malocclusions),
affecting both masticatory function and aesthetics. However, orthodontic
treatment often involves complex, lengthy procedures. As such, generating a 2D
photograph depicting aligned teeth prior to orthodontic treatment is crucial
for effective dentist-patient communication and, more importantly, for
encouraging patients to accept orthodontic intervention. In this paper, we
propose a 3D structure-guided tooth alignment network that takes 2D photographs
as input (e.g., photos captured by smartphones) and aligns the teeth within the
2D image space to generate an orthodontic comparison photograph featuring
aesthetically pleasing, aligned teeth. Notably, while the process operates
within a 2D image space, our method employs 3D intra-oral scanning models
collected in clinics to learn about orthodontic treatment, i.e., projecting the
pre- and post-orthodontic 3D tooth structures onto 2D tooth contours, followed
by a diffusion model to learn the mapping relationship. Ultimately, the aligned
tooth contours are leveraged to guide the generation of a 2D photograph with
aesthetically pleasing, aligned teeth and realistic textures. We evaluate our
network on various facial photographs, demonstrating its exceptional
performance and strong applicability within the orthodontic industry.
comment: Accepted by The 34th British Machine Vision Conference (BMVC 2023)
Our BMVC webpage is https://proceedings.bmvc2023.org/322/
♻ ☆ Perm: A Parametric Representation for Multi-Style 3D Hair Modeling
Chengan He, Xin Sun, Zhixin Shu, Fujun Luan, Sören Pirk, Jorge Alejandro Amador Herrera, Dominik L. Michels, Tuanfeng Y. Wang, Meng Zhang, Holly Rushmeier, Yi Zhou
We present Perm, a learned parametric model of human 3D hair designed to
facilitate various hair-related applications. Unlike previous work that jointly
models the global hair shape and local strand details, we propose to
disentangle them using a PCA-based strand representation in the frequency
domain, thereby allowing more precise editing and output control. Specifically,
we leverage our strand representation to fit and decompose hair geometry
textures into low- to high-frequency hair structures. These decomposed textures
are later parameterized with different generative models, emulating common
stages in the hair modeling process. We conduct extensive experiments to
validate the architecture design of \textsc{Perm}, and finally deploy the
trained model as a generic prior to solve task-agnostic problems, further
showcasing its flexibility and superiority in tasks such as 3D hair
parameterization, hairstyle interpolation, single-view hair reconstruction, and
hair-conditioned image generation. Our code, data, and supplemental can be
found at our project page: https://cs.yale.edu/homes/che/projects/perm/
comment: Project page: https://cs.yale.edu/homes/che/projects/perm/
♻ ☆ EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
In this work, we tackle the challenge of enhancing the realism and
expressiveness in talking head video generation by focusing on the dynamic and
nuanced relationship between audio cues and facial movements. We identify the
limitations of traditional techniques that often fail to capture the full
spectrum of human expressions and the uniqueness of individual facial styles.
To address these issues, we propose EMO, a novel framework that utilizes a
direct audio-to-video synthesis approach, bypassing the need for intermediate
3D models or facial landmarks. Our method ensures seamless frame transitions
and consistent identity preservation throughout the video, resulting in highly
expressive and lifelike animations. Experimental results demonsrate that EMO is
able to produce not only convincing speaking videos but also singing videos in
various styles, significantly outperforming existing state-of-the-art
methodologies in terms of expressiveness and realism.
♻ ☆ Compression-Realized Deep Structural Network for Video Quality Enhancement ACM MM'24
This paper focuses on the task of quality enhancement for compressed videos.
Although deep network-based video restorers achieve impressive progress, most
of the existing methods lack a structured design to optimally leverage the
priors within compression codecs. Since the quality degradation of the video is
primarily induced by the compression algorithm, a new paradigm is urgently
needed for a more ``conscious'' process of quality enhancement. As a result, we
propose the Compression-Realized Deep Structural Network (CRDS), introducing
three inductive biases aligned with the three primary processes in the classic
compression codec, merging the strengths of classical encoder architecture with
deep network capabilities. Inspired by the residual extraction and domain
transformation process in the codec, a pre-trained Latent Degradation Residual
Auto-Encoder is proposed to transform video frames into a latent feature space,
and the mutual neighborhood attention mechanism is integrated for precise
motion estimation and residual extraction. Furthermore, drawing inspiration
from the quantization noise distribution of the codec, CRDS proposes a novel
Progressive Denoising framework with intermediate supervision that decomposes
the quality enhancement into a series of simpler denoising sub-tasks.
Experimental results on datasets like LDV 2.0 and MFQE 2.0 indicate our
approach surpasses state-of-the-art models.
comment: Accepted by ACM MM'24
♻ ☆ Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation
Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel
classes with limited support examples. Existing methods based on Region
Proposal Networks (RPNs) face two issues: 1) Overfitting suppresses novel class
objects; 2) Dual-branch models require complex spatial correlation strategies
to prevent spatial information loss when generating class prototypes. We
introduce a unified framework, Reference Twice (RefT), to exploit the
relationship between support and query features for FSIS and related tasks. Our
three main contributions are: 1) A novel transformer-based baseline that avoids
overfitting, offering a new direction for FSIS; 2) Demonstrating that support
object queries encode key factors after base training, allowing query features
to be enhanced twice at both feature and query levels using simple
cross-attention, thus avoiding complex spatial correlation interaction; 3)
Introducing a class-enhanced base knowledge distillation loss to address the
issue of DETR-like models struggling with incremental settings due to the input
projection layer, enabling easy extension to incremental FSIS. Extensive
experimental evaluations on the COCO dataset under three FSIS settings
demonstrate that our method performs favorably against existing approaches
across different shots, \eg, $+8.2/+9.4$ performance gain over state-of-the-art
methods with 10/30-shots. Source code and models will be available at
https://github.com/hanyue1648/RefT.
comment: Accepted by T-PAMI
♻ ☆ Towards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation
The ability to learn sequentially from different data sites is crucial for a
deep network in solving practical medical image diagnosis problems due to
privacy restrictions and storage limitations. However, adapting on incoming
site leads to catastrophic forgetting on past sites and decreases
generalizablity on unseen sites. Existing Continual Learning (CL) and Domain
Generalization (DG) methods have been proposed to solve these two challenges
respectively, but none of them can address both simultaneously. Recognizing
this limitation, this paper proposes a novel training paradigm, learning
towards Synchronous Memorizability and Generalizability (SMG-Learning). To
achieve this, we create the orientational gradient alignment to ensure
memorizability on previous sites, and arbitrary gradient alignment to enhance
generalizability on unseen sites. This approach is named as Parallel Gradient
Alignment (PGA). Furthermore, we approximate the PGA as dual meta-objectives
using the first-order Taylor expansion to reduce computational cost of aligning
gradients. Considering that performing gradient alignments, especially for
previous sites, is not feasible due to the privacy constraints, we design a
Site-Modulated Diffusion (SMD) model to generate images with site-specific
learnable prompts, replaying images have similar data distributions as previous
sites. We evaluate our method on two medical image segmentation tasks, where
data from different sites arrive sequentially. Experimental results show that
our method efficiently enhances both memorizability and generalizablity better
than other state-of-the-art methods, delivering satisfactory performance across
all sites. Our code will be available at:
https://github.com/dyxu-cuhkcse/SMG-Learning.
comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
♻ ☆ Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis
We present a novel U-Attention vision Transformer for universal texture
synthesis. We exploit the natural long-range dependencies enabled by the
attention mechanism to allow our approach to synthesize diverse textures while
preserving their structures in a single inference. We propose a hierarchical
hourglass backbone that attends to the global structure and performs patch
mapping at varying scales in a coarse-to-fine-to-coarse stream. Completed by
skip connection and convolution designs that propagate and fuse information at
different scales, our hierarchical U-Attention architecture unifies attention
to features from macro structures to micro details, and progressively refines
synthesis results at successive stages. Our method achieves stronger 2$\times$
synthesis than previous work on both stochastic and structured textures while
generalizing to unseen textures without fine-tuning. Ablation studies
demonstrate the effectiveness of each component of our architecture.
♻ ☆ Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views ECCV 2024
Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian
Synthesizing multi-view 3D from one single image is a significant but
challenging task. Zero-1-to-3 methods have achieved great success by lifting a
2D latent diffusion model to the 3D scope. The target view image is generated
with a single-view source image and the camera pose as condition information.
However, due to the high sparsity of the single input image, Zero-1-to-3 tends
to produce geometry and appearance inconsistency across views, especially for
complex objects. To tackle this issue, we propose to supply more condition
information for the generation model but in a self-prompt way. A cascade
framework is constructed with two Zero-1-to-3 models, named Cascade-Zero123,
which progressively extract 3D information from the source image. Specifically,
several nearby views are first generated by the first model and then fed into
the second-stage model along with the source image as generation conditions.
With amplified self-prompted condition images, our Cascade-Zero123 generates
more consistent novel-view images than Zero-1-to-3. Experiment results
demonstrate remarkable promotion, especially for various complex and
challenging scenes, involving insects, humans, transparent objects, and stacked
multiple objects etc. More demos and code are available at
https://cascadezero123.github.io.
comment: ECCV 2024. Project page: https://cascadezero123.github.io/
♻ ☆ GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao
Large Vision-Language Models (LVLMs) are capable of handling diverse data
types such as imaging, text, and physiological signals, and can be applied in
various fields. In the medical field, LVLMs have a high potential to offer
substantial assistance for diagnosis and treatment. Before that, it is crucial
to develop benchmarks to evaluate LVLMs' effectiveness in various medical
applications. Current benchmarks are often built upon specific academic
literature, mainly focusing on a single domain, and lacking varying perceptual
granularities. Thus, they face specific challenges, including limited clinical
relevance, incomplete evaluations, and insufficient guidance for interactive
LVLMs. To address these limitations, we developed the GMAI-MMBench, the most
comprehensive general medical AI benchmark with well-categorized data structure
and multi-perceptual granularity to date. It is constructed from 285 datasets
across 39 medical image modalities, 18 clinical-related tasks, 18 departments,
and 4 perceptual granularities in a Visual Question Answering (VQA) format.
Additionally, we implemented a lexical tree structure that allows users to
customize evaluation tasks, accommodating various assessment needs and
substantially supporting medical AI research and applications. We evaluated 50
LVLMs, and the results show that even the advanced GPT-4o only achieves an
accuracy of 52%, indicating significant room for improvement. Moreover, we
identified five key insufficiencies in current cutting-edge LVLMs that need to
be addressed to advance the development of better medical applications. We
believe that GMAI-MMBench will stimulate the community to build the next
generation of LVLMs toward GMAI.
Project Page: https://uni-medical.github.io/GMAI-MMBench.github.io/
♻ ☆ Tell Me What's Next: Textual Foresight for Generic UI Representations ACL 2024
Mobile app user interfaces (UIs) are rich with action, text, structure, and
image content that can be utilized to learn generic UI representations for
tasks like automating user commands, summarizing content, and evaluating the
accessibility of user interfaces. Prior work has learned strong visual
representations with local or global captioning losses, but fails to retain
both granularities. To combat this, we propose Textual Foresight, a novel
pretraining objective for learning UI screen representations. Textual Foresight
generates global text descriptions of future UI states given a current UI and
local action taken. Our approach requires joint reasoning over elements and
entire screens, resulting in improved UI features: on generation tasks, UI
agents trained with Textual Foresight outperform state-of-the-art by 2% with
28x fewer images. We train with our newly constructed mobile app dataset,
OpenApp, which results in the first public dataset for app UI representation
learning. OpenApp enables new baselines, and we find Textual Foresight improves
average task performance over them by 5.7% while having access to 2x less data.
comment: Accepted to ACL 2024 Findings. Data and code to be released at
https://github.com/aburns4/textualforesight
♻ ☆ GaussianForest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling
The field of novel-view synthesis has recently witnessed the emergence of 3D
Gaussian Splatting, which represents scenes in a point-based manner and renders
through rasterization. This methodology, in contrast to Radiance Fields that
rely on ray tracing, demonstrates superior rendering quality and speed.
However, the explicit and unstructured nature of 3D Gaussians poses a
significant storage challenge, impeding its broader application. To address
this challenge, we introduce the Gaussian-Forest modeling framework, which
hierarchically represents a scene as a forest of hybrid 3D Gaussians. Each
hybrid Gaussian retains its unique explicit attributes while sharing implicit
ones with its sibling Gaussians, thus optimizing parameterization with
significantly fewer variables. Moreover, adaptive growth and pruning strategies
are designed, ensuring detailed representation in complex regions and a notable
reduction in the number of required Gaussians. Extensive experiments
demonstrate that Gaussian-Forest not only maintains comparable speed and
quality but also achieves a compression rate surpassing 10 times, marking a
significant advancement in efficient scene modeling. Codes will be available at
https://github.com/Xian-Bei/GaussianForest.
♻ ☆ Single-Point Supervised High-Resolution Dynamic Network for Infrared Small Target Detection
Infrared small target detection (IRSTD) tasks are extremely challenging for
two main reasons: 1) it is difficult to obtain accurate labelling information
that is critical to existing methods, and 2) infrared (IR) small target
information is easily lost in deep networks. To address these issues, we
propose a single-point supervised high-resolution dynamic network (SSHD-Net).
In contrast to existing methods, we achieve state-of-the-art (SOTA) detection
performance using only single-point supervision. Specifically, we first design
a high-resolution cross-feature extraction module (HCEM), that achieves
bi-directional feature interaction through stepped feature cascade channels
(SFCC). It balances network depth and feature resolution to maintain deep IR
small-target information. Secondly, the effective integration of global and
local features is achieved through the dynamic coordinate fusion module (DCFM),
which enhances the anti-interference ability in complex backgrounds. In
addition, we introduce the high-resolution multilevel residual module (HMRM) to
enhance the semantic information extraction capability. Finally, we design the
adaptive target localization detection head (ATLDH) to improve detection
accuracy. Experiments on the publicly available datasets NUDT-SIRST and
IRSTD-1k demonstrate the effectiveness of our method. Compared to other SOTA
methods, our method can achieve better detection performance with only a single
point of supervision.
♻ ☆ FastLGS: Speeding up Language Embedded Gaussians with Feature Grid Mapping
The semantically interactive radiance field has always been an appealing task
for its potential to facilitate user-friendly and automated real-world 3D scene
understanding applications. However, it is a challenging task to achieve high
quality, efficiency and zero-shot ability at the same time with semantics in
radiance fields. In this work, we present FastLGS, an approach that supports
real-time open-vocabulary query within 3D Gaussian Splatting (3DGS) under high
resolution. We propose the semantic feature grid to save multi-view CLIP
features which are extracted based on Segment Anything Model (SAM) masks, and
map the grids to low dimensional features for semantic field training through
3DGS. Once trained, we can restore pixel-aligned CLIP embeddings through
feature grids from rendered features for open-vocabulary queries. Comparisons
with other state-of-the-art methods prove that FastLGS can achieve the first
place performance concerning both speed and accuracy, where FastLGS is 98x
faster than LERF and 4x faster than LangSplat. Meanwhile, experiments show that
FastLGS is adaptive and compatible with many downstream tasks, such as 3D
segmentation and 3D object inpainting, which can be easily applied to other 3D
manipulation systems.